Trevor Moore and Adrian Cires
The general goal of this project is to explore the strength of different economic indicators as predictors for the general performance of the stock market. To do this, we analyzed a handful of economic indicators as input features to a handful of index funds which we use as a proxy for the performance of the stock market.
For information on the economic indicators we used please look at the links below.
Federal Funds Effective Rate: https://fred.stlouisfed.org/series/DFF
Interest Rate on Reserve Balances: https://fred.stlouisfed.org/series/IORB
Nominal Broad U.S. Dollar Index: https://fred.stlouisfed.org/series/DTWEXBGS
30-year Breakeven Inflation Rate: https://fred.stlouisfed.org/series/T30YIEM
Federal Funds Target Range - Upper Limit: https://fred.stlouisfed.org/series/DFEDTARU
Federal Funds Target Range - Lower Limit: https://fred.stlouisfed.org/series/DFEDTARL
10-Year Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity (Yield Curve): https://fred.stlouisfed.org/series/T10Y3M
Through this tutorial we will be utilizing pandas, numpy, and scipy, the APIs for which can be found below.
Pandas: https://pandas.pydata.org/docs/reference/index.html
NumPy: https://numpy.org/doc/stable/reference/index.html
SciPy: https://docs.scipy.org/doc/scipy/reference/index.html
Data Collection
We will be using a number of common data science libraries to do processing throughout this tutorial.
# Libraries for entire tutorial
import requests
import pandas as pd
import os
import matplotlib.pyplot as plt
import numpy as np
from scipy.ndimage import gaussian_filter1d
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
Here we read in the economic indicators we use to predict index fund performance.
To do so we use a pandas function called read_csv(). This will automatically parse a csv file into a data frame.
More info on read_csv(): https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
# Storing the data frames of each indicator in a dictionary for easy access
indicators = {}
# read_csv will read a given csv file, ensure to properly specify the path to the csv file in your system
indicators['ffer'] = pd.read_csv('DFF.csv', parse_dates=['DATE'])
indicators['iorb'] = pd.read_csv('IORB.csv', parse_dates=['DATE'])
indicators['usdi'] = pd.read_csv('DTWEXBGS.csv', parse_dates=['DATE'])
indicators['infr'] = pd.read_csv('T30YIEM.csv', parse_dates=['DATE'])
indicators['ffru'] = pd.read_csv('DFEDTARU.csv', parse_dates=['DATE'])
indicators['ffrl'] = pd.read_csv('DFEDTARL.csv', parse_dates=['DATE'])
indicators['ycur'] = pd.read_csv('T10Y3M.csv', parse_dates=['DATE'])
Index funds are pooled investments represent the performance of the (hundreds of) companies contained within the fund. We get the data for 5 common index funds using the website alphadvantage which has data available for free. To get the data we use the get() method on the specified URLs from the website.
More on get(): https://www.w3schools.com/python/ref_requests_get.asp
# Defining the keys for the index funds and for accessing the online data fro alphadvantage
index_funds = ['SPY', 'QQQ', 'VOO', 'IVV', 'VTI']
index_dfs = {}
# ALPHA_VANTAGE_API_KEY = os.getenv('ALPHA_VANTAGE_API_KEY')
ALPHA_VANTAGE_API_KEY = 'AEFXVVJSCOY2X86E'
# Collecting the time and price data for each index fund and storing in a dictionary for easy access
for fund in index_funds:
alpha_vantage_url = f'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol={fund}&apikey={ALPHA_VANTAGE_API_KEY}&outputsize=full'
# requests.get() will grab HTML, json, XML, or other website type data from the given url
r = requests.get(alpha_vantage_url)
# need to run .json() to parse the json information from the alhpadvantage site
data = r.json()
# Extracting the data from the dictionary provided by .json()
df = pd.DataFrame.from_dict(data['Time Series (Daily)'], orient='index')
# Setting the value for the specified key as the data frame extracted
index_dfs[fund] = df
print(fund)
print(df)
SPY
1. open 2. high 3. low 4. close 5. volume
2024-05-17 528.8100 529.5200 527.3200 529.4500 59187585
2024-05-16 529.8800 531.5218 528.5400 528.6900 50244827
2024-05-15 525.8300 530.0800 525.1800 529.7800 59504897
2024-05-14 521.1100 523.8300 520.5600 523.3000 57535867
2024-05-13 522.5600 522.6700 519.7400 520.9100 36716361
... ... ... ... ... ...
1999-11-05 138.6250 139.1093 136.7812 137.8750 7431500
1999-11-04 136.7500 137.3593 135.7656 136.5312 7907500
1999-11-03 136.0000 136.3750 135.1250 135.5000 7222300
1999-11-02 135.9687 137.2500 134.5937 134.5937 6516900
1999-11-01 136.5000 137.0000 135.5625 135.5625 4006500
[6176 rows x 5 columns]
QQQ
1. open 2. high 3. low 4. close 5. volume
2024-05-17 452.1100 452.7200 449.5399 451.7600 35803719
2024-05-16 452.7100 454.6900 451.8100 451.9800 34780412
2024-05-15 448.4300 453.1500 446.9000 452.9000 41464651
2024-05-14 442.6500 446.4650 442.4600 445.9300 34478321
2024-05-13 443.9900 444.0900 441.6500 443.0800 22994192
... ... ... ... ... ...
1999-11-05 137.8000 138.4000 136.4000 136.4000 7567300
1999-11-04 135.4000 135.6000 133.6000 135.0000 10024300
1999-11-03 132.8000 134.3000 132.4000 133.5000 9376300
1999-11-02 131.5000 133.1000 130.4000 130.9000 6417400
1999-11-01 131.5000 133.1000 130.6000 130.8000 4840900
[6176 rows x 5 columns]
VOO
1. open 2. high 3. low 4. close 5. volume
2024-05-17 486.0900 486.7300 484.7000 486.6900 3274035
2024-05-16 487.0500 488.5700 485.8300 485.9700 3347381
2024-05-15 483.3400 487.2300 482.7350 486.9000 4399485
2024-05-14 478.9800 481.4700 478.4800 481.0400 3257724
2024-05-13 480.3500 480.4200 477.7300 478.7700 3108442
... ... ... ... ... ...
2010-09-15 51.3100 51.6900 51.2000 51.6500 18400
2010-09-14 51.4200 51.7400 51.1900 51.5190 118800
2010-09-13 51.4800 51.5700 51.2500 51.5300 67400
2010-09-10 50.8400 50.9300 50.6480 50.8900 17200
2010-09-09 51.2500 51.2500 50.5700 50.6600 53000
[3446 rows x 5 columns]
IVV
1. open 2. high 3. low 4. close 5. volume
2024-05-17 531.4500 532.1700 529.9450 532.1300 4234595
2024-05-16 532.5300 534.1800 531.1700 531.2300 5526537
2024-05-15 528.5100 532.7400 527.8245 532.4800 3568824
2024-05-14 523.7000 526.4250 523.1700 525.9600 3288312
2024-05-13 525.2000 525.2700 522.3800 523.5700 2566804
... ... ... ... ... ...
2000-05-25 140.0000 140.9000 137.9000 138.5000 69600
2000-05-24 137.8000 140.1000 136.7000 139.8000 400300
2000-05-23 140.2000 140.2000 137.7000 137.7000 373900
2000-05-22 140.6000 140.6000 136.8000 139.8000 1850600
2000-05-19 142.7000 142.7000 140.3000 140.7000 775500
[6037 rows x 5 columns]
VTI
1. open 2. high 3. low 4. close 5. volume
2024-05-17 262.1500 262.3000 261.2400 262.3000 2208726
2024-05-16 262.6800 263.2800 261.8600 261.9300 2444983
2024-05-15 260.8500 262.7300 260.4428 262.6400 2508701
2024-05-14 258.2300 259.6700 258.0830 259.4500 2466842
2024-05-13 259.0800 259.0800 257.5600 258.1900 1948407
... ... ... ... ... ...
2001-06-06 117.5000 117.8000 116.7000 116.8000 278500
2001-06-05 116.4000 118.0000 116.4000 117.8000 562400
2001-06-04 116.1000 116.2000 115.3000 116.1000 1018200
2001-06-01 115.1000 115.9000 114.4000 115.6000 2542200
2001-05-31 114.5000 115.5000 114.5000 114.8000 2457200
[5778 rows x 5 columns]
Data Processing
Data processing is the concept of taking raw data and turning into a form that can be used for analysis and modeling. For our purposes we will be taking our Index Fund and Economic Indicator data and formatting it such that all the data can be compiled into a single data frame with a single time frame.
For context a data frame is like an excel spreadsheet but it is implemented by a library called pandas which allows us as programs to a lot of fancy data manipulation. As you will see later when we are filling in missing data points pandas also drastically simplifies a lot of operations.
Start processing of indicators
Looking at data columns we realized they used the file name for the column and we need to change it
indicators['ffer'].head()
| DATE | DFF | |
|---|---|---|
| 0 | 1954-07-01 | 1.13 |
| 1 | 1954-07-02 | 1.25 |
| 2 | 1954-07-03 | 1.25 |
| 3 | 1954-07-04 | 1.25 |
| 4 | 1954-07-05 | 0.88 |
We changed column names to be what they represent
# Renaming the column headers to be easy to read and understand
indicators['ffer'] = indicators['ffer'].rename(columns={'DATE': 'Date', 'DFF': 'Daily Federal Funds Rate'})
indicators['iorb'] = indicators['iorb'].rename(columns={'DATE': 'Date', 'IORB': 'Interest Rate on Reserve Balances'})
indicators['usdi'] = indicators['usdi'].rename(columns={'DATE': 'Date', 'DTWEXBGS': 'Nominal Broad US Dollar Index'})
indicators['infr'] = indicators['infr'].rename(columns={'DATE': 'Date', 'T30YIEM': 'Inflation Rate'})
indicators['ffru'] = indicators['ffru'].rename(columns={'DATE': 'Date', 'DFEDTARU': 'Federal Funds Target Upper Limit'})
indicators['ffrl'] = indicators['ffrl'].rename(columns={'DATE': 'Date', 'DFEDTARL': 'Federal Funds Target Lower Limit'})
indicators['ycur'] = indicators['ycur'].rename(columns={'DATE': 'Date', 'T10Y3M': 'Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity'})
for df in indicators.values():
print(df)
Date Daily Federal Funds Rate
0 1954-07-01 1.13
1 1954-07-02 1.25
2 1954-07-03 1.25
3 1954-07-04 1.25
4 1954-07-05 0.88
... ... ...
25515 2024-05-09 5.33
25516 2024-05-10 5.33
25517 2024-05-11 5.33
25518 2024-05-12 5.33
25519 2024-05-13 5.33
[25520 rows x 2 columns]
Date Interest Rate on Reserve Balances
0 2021-07-29 0.15
1 2021-07-30 0.15
2 2021-07-31 0.15
3 2021-08-01 0.15
4 2021-08-02 0.15
... ... ...
1017 2024-05-11 5.40
1018 2024-05-12 5.40
1019 2024-05-13 5.40
1020 2024-05-14 5.40
1021 2024-05-15 5.40
[1022 rows x 2 columns]
Date Nominal Broad US Dollar Index
0 2006-01-02 101.4155
1 2006-01-03 100.7558
2 2006-01-04 100.2288
3 2006-01-05 100.2992
4 2006-01-06 100.0241
... ... ...
4785 2024-05-06 122.2529
4786 2024-05-07 122.4588
4787 2024-05-08 122.5562
4788 2024-05-09 122.5322
4789 2024-05-10 122.3898
[4790 rows x 2 columns]
Date Inflation Rate
0 2010-02-01 2.46
1 2010-03-01 2.49
2 2010-04-01 2.64
3 2010-05-01 2.46
4 2010-06-01 2.36
.. ... ...
166 2023-12-01 2.19
167 2024-01-01 2.24
168 2024-02-01 2.26
169 2024-03-01 2.27
170 2024-04-01 2.35
[171 rows x 2 columns]
Date Federal Funds Target Upper Limit
0 2008-12-16 0.25
1 2008-12-17 0.25
2 2008-12-18 0.25
3 2008-12-19 0.25
4 2008-12-20 0.25
... ... ...
5625 2024-05-11 5.50
5626 2024-05-12 5.50
5627 2024-05-13 5.50
5628 2024-05-14 5.50
5629 2024-05-15 5.50
[5630 rows x 2 columns]
Date Federal Funds Target Lower Limit
0 2008-12-16 0.00
1 2008-12-17 0.00
2 2008-12-18 0.00
3 2008-12-19 0.00
4 2008-12-20 0.00
... ... ...
5625 2024-05-11 5.25
5626 2024-05-12 5.25
5627 2024-05-13 5.25
5628 2024-05-14 5.25
5629 2024-05-15 5.25
[5630 rows x 2 columns]
Date \
0 1982-01-04
1 1982-01-05
2 1982-01-06
3 1982-01-07
4 1982-01-08
... ...
11048 2024-05-09
11049 2024-05-10
11050 2024-05-13
11051 2024-05-14
11052 2024-05-15
Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity
0 2.32
1 2.24
2 2.43
3 2.46
4 2.50
... ...
11048 -1.01
11049 -0.97
11050 -0.97
11051 -0.99
11052 -1.09
[11053 rows x 2 columns]
To simplify how we are processing data it is usually best to combine related data frames into a single data frame. For our case we will be combining all the indicator data frames into a single master dataframe.
# Converting str Dates to datetime type
for indicator_tag in indicators:
indicators[indicator_tag]['Date'] = pd.to_datetime(indicators[indicator_tag]['Date'])
Make sure all data is a number, so convert non-numbers to NaN
# Converting non numeric values to be numeric or NaN otherwise
for indicator_tag in indicators:
# Getting all the data columns (Dont want to conver the Date column to a number)
cols_to_convert = indicators[indicator_tag].columns.drop('Date')
# Setting all the data columns to have numeric values to get rid of strings
indicators[indicator_tag][cols_to_convert] = indicators[indicator_tag][cols_to_convert].apply(pd.to_numeric, errors='coerce')
Interpolate to remove NaN, these can't be used for processing.
for df in indicators.values():
print(df)
# Filling in NaN values by interpolating between the closest previous and next cell with a float value
for indicator_tag in indicators:
indicators[indicator_tag] = indicators[indicator_tag].interpolate(method='linear', limit_direction='forward', axis=0)
for df in indicators.values():
print(df)
Date Daily Federal Funds Rate
0 1954-07-01 1.13
1 1954-07-02 1.25
2 1954-07-03 1.25
3 1954-07-04 1.25
4 1954-07-05 0.88
... ... ...
25515 2024-05-09 5.33
25516 2024-05-10 5.33
25517 2024-05-11 5.33
25518 2024-05-12 5.33
25519 2024-05-13 5.33
[25520 rows x 2 columns]
Date Interest Rate on Reserve Balances
0 2021-07-29 0.15
1 2021-07-30 0.15
2 2021-07-31 0.15
3 2021-08-01 0.15
4 2021-08-02 0.15
... ... ...
1017 2024-05-11 5.40
1018 2024-05-12 5.40
1019 2024-05-13 5.40
1020 2024-05-14 5.40
1021 2024-05-15 5.40
[1022 rows x 2 columns]
Date Nominal Broad US Dollar Index
0 2006-01-02 101.4155
1 2006-01-03 100.7558
2 2006-01-04 100.2288
3 2006-01-05 100.2992
4 2006-01-06 100.0241
... ... ...
4785 2024-05-06 122.2529
4786 2024-05-07 122.4588
4787 2024-05-08 122.5562
4788 2024-05-09 122.5322
4789 2024-05-10 122.3898
[4790 rows x 2 columns]
Date Inflation Rate
0 2010-02-01 2.46
1 2010-03-01 2.49
2 2010-04-01 2.64
3 2010-05-01 2.46
4 2010-06-01 2.36
.. ... ...
166 2023-12-01 2.19
167 2024-01-01 2.24
168 2024-02-01 2.26
169 2024-03-01 2.27
170 2024-04-01 2.35
[171 rows x 2 columns]
Date Federal Funds Target Upper Limit
0 2008-12-16 0.25
1 2008-12-17 0.25
2 2008-12-18 0.25
3 2008-12-19 0.25
4 2008-12-20 0.25
... ... ...
5625 2024-05-11 5.50
5626 2024-05-12 5.50
5627 2024-05-13 5.50
5628 2024-05-14 5.50
5629 2024-05-15 5.50
[5630 rows x 2 columns]
Date Federal Funds Target Lower Limit
0 2008-12-16 0.00
1 2008-12-17 0.00
2 2008-12-18 0.00
3 2008-12-19 0.00
4 2008-12-20 0.00
... ... ...
5625 2024-05-11 5.25
5626 2024-05-12 5.25
5627 2024-05-13 5.25
5628 2024-05-14 5.25
5629 2024-05-15 5.25
[5630 rows x 2 columns]
Date \
0 1982-01-04
1 1982-01-05
2 1982-01-06
3 1982-01-07
4 1982-01-08
... ...
11048 2024-05-09
11049 2024-05-10
11050 2024-05-13
11051 2024-05-14
11052 2024-05-15
Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity
0 2.32
1 2.24
2 2.43
3 2.46
4 2.50
... ...
11048 -1.01
11049 -0.97
11050 -0.97
11051 -0.99
11052 -1.09
[11053 rows x 2 columns]
Date Daily Federal Funds Rate
0 1954-07-01 1.13
1 1954-07-02 1.25
2 1954-07-03 1.25
3 1954-07-04 1.25
4 1954-07-05 0.88
... ... ...
25515 2024-05-09 5.33
25516 2024-05-10 5.33
25517 2024-05-11 5.33
25518 2024-05-12 5.33
25519 2024-05-13 5.33
[25520 rows x 2 columns]
Date Interest Rate on Reserve Balances
0 2021-07-29 0.15
1 2021-07-30 0.15
2 2021-07-31 0.15
3 2021-08-01 0.15
4 2021-08-02 0.15
... ... ...
1017 2024-05-11 5.40
1018 2024-05-12 5.40
1019 2024-05-13 5.40
1020 2024-05-14 5.40
1021 2024-05-15 5.40
[1022 rows x 2 columns]
Date Nominal Broad US Dollar Index
0 2006-01-02 101.4155
1 2006-01-03 100.7558
2 2006-01-04 100.2288
3 2006-01-05 100.2992
4 2006-01-06 100.0241
... ... ...
4785 2024-05-06 122.2529
4786 2024-05-07 122.4588
4787 2024-05-08 122.5562
4788 2024-05-09 122.5322
4789 2024-05-10 122.3898
[4790 rows x 2 columns]
Date Inflation Rate
0 2010-02-01 2.46
1 2010-03-01 2.49
2 2010-04-01 2.64
3 2010-05-01 2.46
4 2010-06-01 2.36
.. ... ...
166 2023-12-01 2.19
167 2024-01-01 2.24
168 2024-02-01 2.26
169 2024-03-01 2.27
170 2024-04-01 2.35
[171 rows x 2 columns]
Date Federal Funds Target Upper Limit
0 2008-12-16 0.25
1 2008-12-17 0.25
2 2008-12-18 0.25
3 2008-12-19 0.25
4 2008-12-20 0.25
... ... ...
5625 2024-05-11 5.50
5626 2024-05-12 5.50
5627 2024-05-13 5.50
5628 2024-05-14 5.50
5629 2024-05-15 5.50
[5630 rows x 2 columns]
Date Federal Funds Target Lower Limit
0 2008-12-16 0.00
1 2008-12-17 0.00
2 2008-12-18 0.00
3 2008-12-19 0.00
4 2008-12-20 0.00
... ... ...
5625 2024-05-11 5.25
5626 2024-05-12 5.25
5627 2024-05-13 5.25
5628 2024-05-14 5.25
5629 2024-05-15 5.25
[5630 rows x 2 columns]
Date \
0 1982-01-04
1 1982-01-05
2 1982-01-06
3 1982-01-07
4 1982-01-08
... ...
11048 2024-05-09
11049 2024-05-10
11050 2024-05-13
11051 2024-05-14
11052 2024-05-15
Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity
0 2.32
1 2.24
2 2.43
3 2.46
4 2.50
... ...
11048 -1.01
11049 -0.97
11050 -0.97
11051 -0.99
11052 -1.09
[11053 rows x 2 columns]
Since inflation is measured by the change in the price index over time, it is not possible to calculate an analytic construct for inflation. Instead, we interpolate based on nearest neighbor to keep inflation as constant as possible throughout periods where it is unavailable.
# Filling in missing dates for the Inflation Rate in order to convert frequency from monthly to daily
date_range = []
# Creating a copy to avoid in place data manipulation issues
infr_copy = indicators['infr'].copy()
for date in infr_copy['Date']:
# Using pandas to determine the end of the current month
end_date = date + pd.offsets.MonthEnd(1)
# Creating an array of dates to add
date_range = date_range + (pd.date_range(start=date, end=end_date, freq='D').tolist())
# Creating a new data frame that contains all the new dates
all_dates = pd.DataFrame(date_range, columns=['Date'])
print(all_dates)
Date 0 2010-02-01 1 2010-02-02 2 2010-02-03 3 2010-02-04 4 2010-02-05 ... ... 5198 2024-04-26 5199 2024-04-27 5200 2024-04-28 5201 2024-04-29 5202 2024-04-30 [5203 rows x 1 columns]
# Merging the Inflation Rate data frame and Dates data frame to contain the existing Inflation data and all the
# needed dates
infr_copy = all_dates.merge(infr_copy, on='Date', how='left')
print(infr_copy)
Date Inflation Rate 0 2010-02-01 2.46 1 2010-02-02 NaN 2 2010-02-03 NaN 3 2010-02-04 NaN 4 2010-02-05 NaN ... ... ... 5198 2024-04-26 NaN 5199 2024-04-27 NaN 5200 2024-04-28 NaN 5201 2024-04-29 NaN 5202 2024-04-30 NaN [5203 rows x 2 columns]
infr_interpolated = infr_copy.copy()
# Interpolating to fill the added days with usable data, interpolated between the first of the current and next month
for index, row in infr_copy.iterrows():
if not isinstance(row['Inflation Rate'], float):
date = row['Date']
# Finding the start of the current month
start = date + pd.offsets.MonthBegin(-1)
# Finding the end of the current month
end = date + pd.offsets.MonthEnd(1)
# Calculating the number of days between the current day and the first and last day of the month
diff_start = (date - start).days
diff_end = (end - date).days
# Getting the inflation rate of the first day of the month
infr_start = infr_copy['Inflation Rate'][index - diff_start]
# Getting the inflation rate of the first day of the next month
if index + diff_end + 1 < len(infr_copy['Date']):
infr_end = infr_copy['Inflation Rate'][index + diff_end + 1]
# Calculating the interpolated inflation value
infr_interpolated.loc[index, 'Inflation Rate'] = (infr_end-infr_start)/(end-start).days*(diff_start) + infr_start
infr = row['Inflation Rate']
print(infr_interpolated)
indicators['infr'] = infr_interpolated
Date Inflation Rate 0 2010-02-01 2.460000 1 2010-02-02 2.461111 2 2010-02-03 2.462222 3 2010-02-04 2.463333 4 2010-02-05 2.464444 ... ... ... 5198 2024-04-26 2.350000 5199 2024-04-27 2.350000 5200 2024-04-28 2.350000 5201 2024-04-29 2.350000 5202 2024-04-30 2.350000 [5203 rows x 2 columns]
print(indicators['infr'])
Date Inflation Rate 0 2010-02-01 2.460000 1 2010-02-02 2.461111 2 2010-02-03 2.462222 3 2010-02-04 2.463333 4 2010-02-05 2.464444 ... ... ... 5198 2024-04-26 2.350000 5199 2024-04-27 2.350000 5200 2024-04-28 2.350000 5201 2024-04-29 2.350000 5202 2024-04-30 2.350000 [5203 rows x 2 columns]
However, if we look at the data we can see that every indicator uses a different time frame.
To start we will determine the indicator with the most recent start time and cut out all other data from the other indicators.
# Finding the starting date for each indicator
starting_dates = [df['Date'].min() for df in indicators.values()]
print(starting_dates)
[Timestamp('1954-07-01 00:00:00'), Timestamp('2021-07-29 00:00:00'), Timestamp('2006-01-02 00:00:00'), Timestamp('2010-02-01 00:00:00'), Timestamp('2008-12-16 00:00:00'), Timestamp('2008-12-16 00:00:00'), Timestamp('1982-01-04 00:00:00')]
# Finding the most recent date between the starting dates
most_recent_date = starting_dates.index(max(starting_dates))
print(starting_dates[most_recent_date])
2021-07-29 00:00:00
So we now have our most recent date and we know where to cut off the other indicators... HOWEVER! It is important to always look at the rest of your data! In this case we want to see how the earliest date compares to the other dates. If it's way earlier than the others we may be unnecessarily cutting off YEARS of data when we could instead remove the outlying indicator or fill in values.
To do this we will instead order the dates from earliest to latest.
# Ordering the start dates to make them easier to read
sorted_dates = sorted(starting_dates, reverse=True)
print(sorted_dates)
[Timestamp('2021-07-29 00:00:00'), Timestamp('2010-02-01 00:00:00'), Timestamp('2008-12-16 00:00:00'), Timestamp('2008-12-16 00:00:00'), Timestamp('2006-01-02 00:00:00'), Timestamp('1982-01-04 00:00:00'), Timestamp('1954-07-01 00:00:00')]
Our worries were justified! The 2021 date is 11 years earlier than the next date. On top of that we notice the next few dates tend to clump around the late 2000s. Because of this we will use the second earliest date to ensure we have sufficient data to work with.
# Selecting the second most recent start date
most_recent_date = sorted_dates[1]
print(most_recent_date)
2010-02-01 00:00:00
We extend the IORB data to be the lower bound of the target federal funds rate due to the differences in how monetary policy was conducted before this rate was implemented by the fed. Before the IORB rate, the fed implemented an Interest on Excess Reserves Rate (IOER rate) and Interest on Required Reserves (IORR) rate were in use, but got merged once the Fed started to use the new monetary policy tools.
# Determining what data needs to be added to IORB from the Federal Fund Rate Lower Limit
filtered_ffrl = indicators['ffrl'][indicators['ffrl']['Date'] < indicators['iorb']['Date'].min()]
filtered_ffrl = filtered_ffrl.rename(columns={'Federal Funds Target Lower Limit': 'Interest Rate on Reserve Balances'})
# Concatenating the additional data
result_iorb = pd.concat([filtered_ffrl, indicators['iorb']], ignore_index=True)
indicators['iorb'] = result_iorb
print(indicators['iorb'])
Date Interest Rate on Reserve Balances 0 2008-12-16 0.0 1 2008-12-17 0.0 2 2008-12-18 0.0 3 2008-12-19 0.0 4 2008-12-20 0.0 ... ... ... 5625 2024-05-11 5.4 5626 2024-05-12 5.4 5627 2024-05-13 5.4 5628 2024-05-14 5.4 5629 2024-05-15 5.4 [5630 rows x 2 columns]
You want your data to have aligned start points, so set your data to start at the start date for the latest-starting data point
# Cutting off data that comes before the chosen start date
for indicator_tag in indicators:
df = indicators[indicator_tag]
# Using data frame short cuts to removes dates for the Date column that come before the specified date
filtered_df = indicators[indicator_tag][indicators[indicator_tag]['Date'] >= most_recent_date]
# Resetting the index numbers since we deleted some rows
filtered_df = filtered_df.reset_index(drop=True)
indicators[indicator_tag] = filtered_df
for df in indicators.values():
print(df)
Date Daily Federal Funds Rate
0 2010-02-01 0.14
1 2010-02-02 0.14
2 2010-02-03 0.13
3 2010-02-04 0.14
4 2010-02-05 0.13
... ... ...
5211 2024-05-09 5.33
5212 2024-05-10 5.33
5213 2024-05-11 5.33
5214 2024-05-12 5.33
5215 2024-05-13 5.33
[5216 rows x 2 columns]
Date Interest Rate on Reserve Balances
0 2010-02-01 0.0
1 2010-02-02 0.0
2 2010-02-03 0.0
3 2010-02-04 0.0
4 2010-02-05 0.0
... ... ...
5213 2024-05-11 5.4
5214 2024-05-12 5.4
5215 2024-05-13 5.4
5216 2024-05-14 5.4
5217 2024-05-15 5.4
[5218 rows x 2 columns]
Date Nominal Broad US Dollar Index
0 2010-02-01 93.7321
1 2010-02-02 93.4136
2 2010-02-03 93.6268
3 2010-02-04 94.1401
4 2010-02-05 94.6218
... ... ...
3720 2024-05-06 122.2529
3721 2024-05-07 122.4588
3722 2024-05-08 122.5562
3723 2024-05-09 122.5322
3724 2024-05-10 122.3898
[3725 rows x 2 columns]
Date Inflation Rate
0 2010-02-01 2.460000
1 2010-02-02 2.461111
2 2010-02-03 2.462222
3 2010-02-04 2.463333
4 2010-02-05 2.464444
... ... ...
5198 2024-04-26 2.350000
5199 2024-04-27 2.350000
5200 2024-04-28 2.350000
5201 2024-04-29 2.350000
5202 2024-04-30 2.350000
[5203 rows x 2 columns]
Date Federal Funds Target Upper Limit
0 2010-02-01 0.25
1 2010-02-02 0.25
2 2010-02-03 0.25
3 2010-02-04 0.25
4 2010-02-05 0.25
... ... ...
5213 2024-05-11 5.50
5214 2024-05-12 5.50
5215 2024-05-13 5.50
5216 2024-05-14 5.50
5217 2024-05-15 5.50
[5218 rows x 2 columns]
Date Federal Funds Target Lower Limit
0 2010-02-01 0.00
1 2010-02-02 0.00
2 2010-02-03 0.00
3 2010-02-04 0.00
4 2010-02-05 0.00
... ... ...
5213 2024-05-11 5.25
5214 2024-05-12 5.25
5215 2024-05-13 5.25
5216 2024-05-14 5.25
5217 2024-05-15 5.25
[5218 rows x 2 columns]
Date \
0 2010-02-01
1 2010-02-02
2 2010-02-03
3 2010-02-04
4 2010-02-05
... ...
3723 2024-05-09
3724 2024-05-10
3725 2024-05-13
3726 2024-05-14
3727 2024-05-15
Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity
0 3.58
1 3.57
2 3.63
3 3.53
4 3.49
... ...
3723 -1.01
3724 -0.97
3725 -0.97
3726 -0.99
3727 -1.09
[3728 rows x 2 columns]
In a similar vein, all end dates need to be the same. So set your end date to be the end date from your earliest-ending data.
# Finding the ending dates for all the indicator data
end_dates = [df['Date'].max() for df in indicators.values()]
print(end_dates)
[Timestamp('2024-05-13 00:00:00'), Timestamp('2024-05-15 00:00:00'), Timestamp('2024-05-10 00:00:00'), Timestamp('2024-04-30 00:00:00'), Timestamp('2024-05-15 00:00:00'), Timestamp('2024-05-15 00:00:00'), Timestamp('2024-05-15 00:00:00')]
# Finding the oldest date from all the ending dates
least_recent_date = end_dates.index(min(end_dates))
print(end_dates[least_recent_date])
2024-04-30 00:00:00
# Sorting the ending dates to make them easier to read
sorted_dates = sorted(end_dates, reverse=True)
print(sorted_dates)
[Timestamp('2024-05-15 00:00:00'), Timestamp('2024-05-15 00:00:00'), Timestamp('2024-05-15 00:00:00'), Timestamp('2024-05-15 00:00:00'), Timestamp('2024-05-13 00:00:00'), Timestamp('2024-05-10 00:00:00'), Timestamp('2024-04-30 00:00:00')]
# Selecting the last date since it is the least recent
least_recent_date = sorted_dates[len(sorted_dates) - 1]
print(least_recent_date)
2024-04-30 00:00:00
# Cutting off data after the chosen ending date so all the indicators end at the same date
for indicator_tag in indicators:
df = indicators[indicator_tag]
# Using data frame short cuts to removes dates for the Date column that come after the specified date
filtered_df = indicators[indicator_tag][indicators[indicator_tag]['Date'] <= least_recent_date]
# Resetting the index numbers since we deleted some rows
filtered_df = filtered_df.reset_index(drop=True)
indicators[indicator_tag] = filtered_df
for df in indicators.values():
print(df)
Date Daily Federal Funds Rate
0 2010-02-01 0.14
1 2010-02-02 0.14
2 2010-02-03 0.13
3 2010-02-04 0.14
4 2010-02-05 0.13
... ... ...
5198 2024-04-26 5.33
5199 2024-04-27 5.33
5200 2024-04-28 5.33
5201 2024-04-29 5.33
5202 2024-04-30 5.33
[5203 rows x 2 columns]
Date Interest Rate on Reserve Balances
0 2010-02-01 0.0
1 2010-02-02 0.0
2 2010-02-03 0.0
3 2010-02-04 0.0
4 2010-02-05 0.0
... ... ...
5198 2024-04-26 5.4
5199 2024-04-27 5.4
5200 2024-04-28 5.4
5201 2024-04-29 5.4
5202 2024-04-30 5.4
[5203 rows x 2 columns]
Date Nominal Broad US Dollar Index
0 2010-02-01 93.7321
1 2010-02-02 93.4136
2 2010-02-03 93.6268
3 2010-02-04 94.1401
4 2010-02-05 94.6218
... ... ...
3712 2024-04-24 123.2398
3713 2024-04-25 123.1916
3714 2024-04-26 123.3004
3715 2024-04-29 122.9074
3716 2024-04-30 123.3446
[3717 rows x 2 columns]
Date Inflation Rate
0 2010-02-01 2.460000
1 2010-02-02 2.461111
2 2010-02-03 2.462222
3 2010-02-04 2.463333
4 2010-02-05 2.464444
... ... ...
5198 2024-04-26 2.350000
5199 2024-04-27 2.350000
5200 2024-04-28 2.350000
5201 2024-04-29 2.350000
5202 2024-04-30 2.350000
[5203 rows x 2 columns]
Date Federal Funds Target Upper Limit
0 2010-02-01 0.25
1 2010-02-02 0.25
2 2010-02-03 0.25
3 2010-02-04 0.25
4 2010-02-05 0.25
... ... ...
5198 2024-04-26 5.50
5199 2024-04-27 5.50
5200 2024-04-28 5.50
5201 2024-04-29 5.50
5202 2024-04-30 5.50
[5203 rows x 2 columns]
Date Federal Funds Target Lower Limit
0 2010-02-01 0.00
1 2010-02-02 0.00
2 2010-02-03 0.00
3 2010-02-04 0.00
4 2010-02-05 0.00
... ... ...
5198 2024-04-26 5.25
5199 2024-04-27 5.25
5200 2024-04-28 5.25
5201 2024-04-29 5.25
5202 2024-04-30 5.25
[5203 rows x 2 columns]
Date \
0 2010-02-01
1 2010-02-02
2 2010-02-03
3 2010-02-04
4 2010-02-05
... ...
3712 2024-04-24
3713 2024-04-25
3714 2024-04-26
3715 2024-04-29
3716 2024-04-30
Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity
0 3.58
1 3.57
2 3.63
3 3.53
4 3.49
... ...
3712 -0.81
3713 -0.77
3714 -0.79
3715 -0.82
3716 -0.77
[3717 rows x 2 columns]
# Removing weekend times from all indicator data
for indicator in indicators:
indicators[indicator] = indicators[indicator][indicators[indicator]['Date'].dt.dayofweek < 5]
indicators[indicator] = indicators[indicator].reset_index(drop=True)
for df in indicators.values():
print(df)
Date Daily Federal Funds Rate
0 2010-02-01 0.14
1 2010-02-02 0.14
2 2010-02-03 0.13
3 2010-02-04 0.14
4 2010-02-05 0.13
... ... ...
3712 2024-04-24 5.33
3713 2024-04-25 5.33
3714 2024-04-26 5.33
3715 2024-04-29 5.33
3716 2024-04-30 5.33
[3717 rows x 2 columns]
Date Interest Rate on Reserve Balances
0 2010-02-01 0.0
1 2010-02-02 0.0
2 2010-02-03 0.0
3 2010-02-04 0.0
4 2010-02-05 0.0
... ... ...
3712 2024-04-24 5.4
3713 2024-04-25 5.4
3714 2024-04-26 5.4
3715 2024-04-29 5.4
3716 2024-04-30 5.4
[3717 rows x 2 columns]
Date Nominal Broad US Dollar Index
0 2010-02-01 93.7321
1 2010-02-02 93.4136
2 2010-02-03 93.6268
3 2010-02-04 94.1401
4 2010-02-05 94.6218
... ... ...
3712 2024-04-24 123.2398
3713 2024-04-25 123.1916
3714 2024-04-26 123.3004
3715 2024-04-29 122.9074
3716 2024-04-30 123.3446
[3717 rows x 2 columns]
Date Inflation Rate
0 2010-02-01 2.460000
1 2010-02-02 2.461111
2 2010-02-03 2.462222
3 2010-02-04 2.463333
4 2010-02-05 2.464444
... ... ...
3712 2024-04-24 2.350000
3713 2024-04-25 2.350000
3714 2024-04-26 2.350000
3715 2024-04-29 2.350000
3716 2024-04-30 2.350000
[3717 rows x 2 columns]
Date Federal Funds Target Upper Limit
0 2010-02-01 0.25
1 2010-02-02 0.25
2 2010-02-03 0.25
3 2010-02-04 0.25
4 2010-02-05 0.25
... ... ...
3712 2024-04-24 5.50
3713 2024-04-25 5.50
3714 2024-04-26 5.50
3715 2024-04-29 5.50
3716 2024-04-30 5.50
[3717 rows x 2 columns]
Date Federal Funds Target Lower Limit
0 2010-02-01 0.00
1 2010-02-02 0.00
2 2010-02-03 0.00
3 2010-02-04 0.00
4 2010-02-05 0.00
... ... ...
3712 2024-04-24 5.25
3713 2024-04-25 5.25
3714 2024-04-26 5.25
3715 2024-04-29 5.25
3716 2024-04-30 5.25
[3717 rows x 2 columns]
Date \
0 2010-02-01
1 2010-02-02
2 2010-02-03
3 2010-02-04
4 2010-02-05
... ...
3712 2024-04-24
3713 2024-04-25
3714 2024-04-26
3715 2024-04-29
3716 2024-04-30
Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity
0 3.58
1 3.57
2 3.63
3 3.53
4 3.49
... ...
3712 -0.81
3713 -0.77
3714 -0.79
3715 -0.82
3716 -0.77
[3717 rows x 2 columns]
# Combining all the indicators into a single data frame since they have consistent Dates
indicator_df = indicators['ffer']
for indicator_tag in indicators:
if indicator_tag != 'ffer':
indicator_df = pd.merge(indicator_df, indicators[indicator_tag], on='Date')
print(indicator_df.head(30))
Date Daily Federal Funds Rate Interest Rate on Reserve Balances \
0 2010-02-01 0.14 0.0
1 2010-02-02 0.14 0.0
2 2010-02-03 0.13 0.0
3 2010-02-04 0.14 0.0
4 2010-02-05 0.13 0.0
5 2010-02-08 0.13 0.0
6 2010-02-09 0.13 0.0
7 2010-02-10 0.12 0.0
8 2010-02-11 0.12 0.0
9 2010-02-12 0.12 0.0
10 2010-02-15 0.12 0.0
11 2010-02-16 0.13 0.0
12 2010-02-17 0.12 0.0
13 2010-02-18 0.12 0.0
14 2010-02-19 0.13 0.0
15 2010-02-22 0.12 0.0
16 2010-02-23 0.12 0.0
17 2010-02-24 0.11 0.0
18 2010-02-25 0.12 0.0
19 2010-02-26 0.13 0.0
20 2010-03-01 0.14 0.0
21 2010-03-02 0.14 0.0
22 2010-03-03 0.15 0.0
23 2010-03-04 0.16 0.0
24 2010-03-05 0.17 0.0
25 2010-03-08 0.15 0.0
26 2010-03-09 0.14 0.0
27 2010-03-10 0.14 0.0
28 2010-03-11 0.15 0.0
29 2010-03-12 0.17 0.0
Nominal Broad US Dollar Index Inflation Rate \
0 93.73210 2.460000
1 93.41360 2.461111
2 93.62680 2.462222
3 94.14010 2.463333
4 94.62180 2.464444
5 94.49696 2.467778
6 94.37212 2.468889
7 94.24728 2.470000
8 94.12244 2.471111
9 93.99760 2.472222
10 93.73980 2.475556
11 93.48200 2.476667
12 93.71710 2.477778
13 93.87520 2.478889
14 93.95690 2.480000
15 93.74020 2.483333
16 93.97940 2.484444
17 93.97750 2.485556
18 94.37970 2.486667
19 93.66020 2.487778
20 93.79790 2.490000
21 93.48800 2.495000
22 92.98610 2.500000
23 93.40790 2.505000
24 93.26230 2.510000
25 93.24460 2.525000
26 93.19410 2.530000
27 93.04260 2.535000
28 93.05330 2.540000
29 92.66160 2.545000
Federal Funds Target Upper Limit Federal Funds Target Lower Limit \
0 0.25 0.0
1 0.25 0.0
2 0.25 0.0
3 0.25 0.0
4 0.25 0.0
5 0.25 0.0
6 0.25 0.0
7 0.25 0.0
8 0.25 0.0
9 0.25 0.0
10 0.25 0.0
11 0.25 0.0
12 0.25 0.0
13 0.25 0.0
14 0.25 0.0
15 0.25 0.0
16 0.25 0.0
17 0.25 0.0
18 0.25 0.0
19 0.25 0.0
20 0.25 0.0
21 0.25 0.0
22 0.25 0.0
23 0.25 0.0
24 0.25 0.0
25 0.25 0.0
26 0.25 0.0
27 0.25 0.0
28 0.25 0.0
29 0.25 0.0
Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity
0 3.580
1 3.570
2 3.630
3 3.530
4 3.490
5 3.500
6 3.550
7 3.610
8 3.620
9 3.590
10 3.575
11 3.560
12 3.640
13 3.690
14 3.670
15 3.690
16 3.570
17 3.580
18 3.510
19 3.480
20 3.480
21 3.480
22 3.490
23 3.470
24 3.540
25 3.560
26 3.550
27 3.580
28 3.570
29 3.560
Start processing of funds
We use the same general process of isolating all data into columns, ensuring dates are of type datetime and all data is numeric, and then filling in NaN cells and cutting off the start and end times.
# Moving Date information to its own column
for index_tag in index_dfs:
index_dfs[index_tag] = index_dfs[index_tag].reset_index()
for df in index_dfs.values():
print(df)
index 1. open 2. high 3. low 4. close 5. volume
0 2024-05-17 528.8100 529.5200 527.3200 529.4500 59187585
1 2024-05-16 529.8800 531.5218 528.5400 528.6900 50244827
2 2024-05-15 525.8300 530.0800 525.1800 529.7800 59504897
3 2024-05-14 521.1100 523.8300 520.5600 523.3000 57535867
4 2024-05-13 522.5600 522.6700 519.7400 520.9100 36716361
... ... ... ... ... ... ...
6171 1999-11-05 138.6250 139.1093 136.7812 137.8750 7431500
6172 1999-11-04 136.7500 137.3593 135.7656 136.5312 7907500
6173 1999-11-03 136.0000 136.3750 135.1250 135.5000 7222300
6174 1999-11-02 135.9687 137.2500 134.5937 134.5937 6516900
6175 1999-11-01 136.5000 137.0000 135.5625 135.5625 4006500
[6176 rows x 6 columns]
index 1. open 2. high 3. low 4. close 5. volume
0 2024-05-17 452.1100 452.7200 449.5399 451.7600 35803719
1 2024-05-16 452.7100 454.6900 451.8100 451.9800 34780412
2 2024-05-15 448.4300 453.1500 446.9000 452.9000 41464651
3 2024-05-14 442.6500 446.4650 442.4600 445.9300 34478321
4 2024-05-13 443.9900 444.0900 441.6500 443.0800 22994192
... ... ... ... ... ... ...
6171 1999-11-05 137.8000 138.4000 136.4000 136.4000 7567300
6172 1999-11-04 135.4000 135.6000 133.6000 135.0000 10024300
6173 1999-11-03 132.8000 134.3000 132.4000 133.5000 9376300
6174 1999-11-02 131.5000 133.1000 130.4000 130.9000 6417400
6175 1999-11-01 131.5000 133.1000 130.6000 130.8000 4840900
[6176 rows x 6 columns]
index 1. open 2. high 3. low 4. close 5. volume
0 2024-05-17 486.0900 486.7300 484.7000 486.6900 3274035
1 2024-05-16 487.0500 488.5700 485.8300 485.9700 3347381
2 2024-05-15 483.3400 487.2300 482.7350 486.9000 4399485
3 2024-05-14 478.9800 481.4700 478.4800 481.0400 3257724
4 2024-05-13 480.3500 480.4200 477.7300 478.7700 3108442
... ... ... ... ... ... ...
3441 2010-09-15 51.3100 51.6900 51.2000 51.6500 18400
3442 2010-09-14 51.4200 51.7400 51.1900 51.5190 118800
3443 2010-09-13 51.4800 51.5700 51.2500 51.5300 67400
3444 2010-09-10 50.8400 50.9300 50.6480 50.8900 17200
3445 2010-09-09 51.2500 51.2500 50.5700 50.6600 53000
[3446 rows x 6 columns]
index 1. open 2. high 3. low 4. close 5. volume
0 2024-05-17 531.4500 532.1700 529.9450 532.1300 4234595
1 2024-05-16 532.5300 534.1800 531.1700 531.2300 5526537
2 2024-05-15 528.5100 532.7400 527.8245 532.4800 3568824
3 2024-05-14 523.7000 526.4250 523.1700 525.9600 3288312
4 2024-05-13 525.2000 525.2700 522.3800 523.5700 2566804
... ... ... ... ... ... ...
6032 2000-05-25 140.0000 140.9000 137.9000 138.5000 69600
6033 2000-05-24 137.8000 140.1000 136.7000 139.8000 400300
6034 2000-05-23 140.2000 140.2000 137.7000 137.7000 373900
6035 2000-05-22 140.6000 140.6000 136.8000 139.8000 1850600
6036 2000-05-19 142.7000 142.7000 140.3000 140.7000 775500
[6037 rows x 6 columns]
index 1. open 2. high 3. low 4. close 5. volume
0 2024-05-17 262.1500 262.3000 261.2400 262.3000 2208726
1 2024-05-16 262.6800 263.2800 261.8600 261.9300 2444983
2 2024-05-15 260.8500 262.7300 260.4428 262.6400 2508701
3 2024-05-14 258.2300 259.6700 258.0830 259.4500 2466842
4 2024-05-13 259.0800 259.0800 257.5600 258.1900 1948407
... ... ... ... ... ... ...
5773 2001-06-06 117.5000 117.8000 116.7000 116.8000 278500
5774 2001-06-05 116.4000 118.0000 116.4000 117.8000 562400
5775 2001-06-04 116.1000 116.2000 115.3000 116.1000 1018200
5776 2001-06-01 115.1000 115.9000 114.4000 115.6000 2542200
5777 2001-05-31 114.5000 115.5000 114.5000 114.8000 2457200
[5778 rows x 6 columns]
# Renaming all index fund data frame columns to be readable and logical
for df in index_dfs.values():
df.rename(columns={"index": "Date", "1. open": "open", "2. high": "high", "3. low": "low", "4. close": "close", "5. volume": "volume"}, inplace=True)
for df in index_dfs.values():
print(df)
Date open high low close volume
0 2024-05-17 528.8100 529.5200 527.3200 529.4500 59187585
1 2024-05-16 529.8800 531.5218 528.5400 528.6900 50244827
2 2024-05-15 525.8300 530.0800 525.1800 529.7800 59504897
3 2024-05-14 521.1100 523.8300 520.5600 523.3000 57535867
4 2024-05-13 522.5600 522.6700 519.7400 520.9100 36716361
... ... ... ... ... ... ...
6171 1999-11-05 138.6250 139.1093 136.7812 137.8750 7431500
6172 1999-11-04 136.7500 137.3593 135.7656 136.5312 7907500
6173 1999-11-03 136.0000 136.3750 135.1250 135.5000 7222300
6174 1999-11-02 135.9687 137.2500 134.5937 134.5937 6516900
6175 1999-11-01 136.5000 137.0000 135.5625 135.5625 4006500
[6176 rows x 6 columns]
Date open high low close volume
0 2024-05-17 452.1100 452.7200 449.5399 451.7600 35803719
1 2024-05-16 452.7100 454.6900 451.8100 451.9800 34780412
2 2024-05-15 448.4300 453.1500 446.9000 452.9000 41464651
3 2024-05-14 442.6500 446.4650 442.4600 445.9300 34478321
4 2024-05-13 443.9900 444.0900 441.6500 443.0800 22994192
... ... ... ... ... ... ...
6171 1999-11-05 137.8000 138.4000 136.4000 136.4000 7567300
6172 1999-11-04 135.4000 135.6000 133.6000 135.0000 10024300
6173 1999-11-03 132.8000 134.3000 132.4000 133.5000 9376300
6174 1999-11-02 131.5000 133.1000 130.4000 130.9000 6417400
6175 1999-11-01 131.5000 133.1000 130.6000 130.8000 4840900
[6176 rows x 6 columns]
Date open high low close volume
0 2024-05-17 486.0900 486.7300 484.7000 486.6900 3274035
1 2024-05-16 487.0500 488.5700 485.8300 485.9700 3347381
2 2024-05-15 483.3400 487.2300 482.7350 486.9000 4399485
3 2024-05-14 478.9800 481.4700 478.4800 481.0400 3257724
4 2024-05-13 480.3500 480.4200 477.7300 478.7700 3108442
... ... ... ... ... ... ...
3441 2010-09-15 51.3100 51.6900 51.2000 51.6500 18400
3442 2010-09-14 51.4200 51.7400 51.1900 51.5190 118800
3443 2010-09-13 51.4800 51.5700 51.2500 51.5300 67400
3444 2010-09-10 50.8400 50.9300 50.6480 50.8900 17200
3445 2010-09-09 51.2500 51.2500 50.5700 50.6600 53000
[3446 rows x 6 columns]
Date open high low close volume
0 2024-05-17 531.4500 532.1700 529.9450 532.1300 4234595
1 2024-05-16 532.5300 534.1800 531.1700 531.2300 5526537
2 2024-05-15 528.5100 532.7400 527.8245 532.4800 3568824
3 2024-05-14 523.7000 526.4250 523.1700 525.9600 3288312
4 2024-05-13 525.2000 525.2700 522.3800 523.5700 2566804
... ... ... ... ... ... ...
6032 2000-05-25 140.0000 140.9000 137.9000 138.5000 69600
6033 2000-05-24 137.8000 140.1000 136.7000 139.8000 400300
6034 2000-05-23 140.2000 140.2000 137.7000 137.7000 373900
6035 2000-05-22 140.6000 140.6000 136.8000 139.8000 1850600
6036 2000-05-19 142.7000 142.7000 140.3000 140.7000 775500
[6037 rows x 6 columns]
Date open high low close volume
0 2024-05-17 262.1500 262.3000 261.2400 262.3000 2208726
1 2024-05-16 262.6800 263.2800 261.8600 261.9300 2444983
2 2024-05-15 260.8500 262.7300 260.4428 262.6400 2508701
3 2024-05-14 258.2300 259.6700 258.0830 259.4500 2466842
4 2024-05-13 259.0800 259.0800 257.5600 258.1900 1948407
... ... ... ... ... ... ...
5773 2001-06-06 117.5000 117.8000 116.7000 116.8000 278500
5774 2001-06-05 116.4000 118.0000 116.4000 117.8000 562400
5775 2001-06-04 116.1000 116.2000 115.3000 116.1000 1018200
5776 2001-06-01 115.1000 115.9000 114.4000 115.6000 2542200
5777 2001-05-31 114.5000 115.5000 114.5000 114.8000 2457200
[5778 rows x 6 columns]
# Converting str Dates to datetime type
for index_tag in index_dfs:
index_dfs[index_tag]['Date'] = pd.to_datetime(index_dfs[index_tag]['Date'])
# Converting non numerics to be numeric or NaN
for index_tag in index_dfs:
cols_to_convert = index_dfs[index_tag].columns.drop('Date')
index_dfs[index_tag][cols_to_convert] = index_dfs[index_tag][cols_to_convert].apply(pd.to_numeric, errors='coerce')
for df in index_dfs.values():
print(df)
# Filling NaN values by interpolating like with the indicator data
for index_tag in index_dfs:
index_dfs[index_tag] = index_dfs[index_tag].interpolate(method='linear', limit_direction='forward', axis=0)
for df in index_dfs.values():
print(df)
Date open high low close volume
0 2024-05-17 528.8100 529.5200 527.3200 529.4500 59187585
1 2024-05-16 529.8800 531.5218 528.5400 528.6900 50244827
2 2024-05-15 525.8300 530.0800 525.1800 529.7800 59504897
3 2024-05-14 521.1100 523.8300 520.5600 523.3000 57535867
4 2024-05-13 522.5600 522.6700 519.7400 520.9100 36716361
... ... ... ... ... ... ...
6171 1999-11-05 138.6250 139.1093 136.7812 137.8750 7431500
6172 1999-11-04 136.7500 137.3593 135.7656 136.5312 7907500
6173 1999-11-03 136.0000 136.3750 135.1250 135.5000 7222300
6174 1999-11-02 135.9687 137.2500 134.5937 134.5937 6516900
6175 1999-11-01 136.5000 137.0000 135.5625 135.5625 4006500
[6176 rows x 6 columns]
Date open high low close volume
0 2024-05-17 452.11 452.720 449.5399 451.76 35803719
1 2024-05-16 452.71 454.690 451.8100 451.98 34780412
2 2024-05-15 448.43 453.150 446.9000 452.90 41464651
3 2024-05-14 442.65 446.465 442.4600 445.93 34478321
4 2024-05-13 443.99 444.090 441.6500 443.08 22994192
... ... ... ... ... ... ...
6171 1999-11-05 137.80 138.400 136.4000 136.40 7567300
6172 1999-11-04 135.40 135.600 133.6000 135.00 10024300
6173 1999-11-03 132.80 134.300 132.4000 133.50 9376300
6174 1999-11-02 131.50 133.100 130.4000 130.90 6417400
6175 1999-11-01 131.50 133.100 130.6000 130.80 4840900
[6176 rows x 6 columns]
Date open high low close volume
0 2024-05-17 486.09 486.73 484.700 486.690 3274035
1 2024-05-16 487.05 488.57 485.830 485.970 3347381
2 2024-05-15 483.34 487.23 482.735 486.900 4399485
3 2024-05-14 478.98 481.47 478.480 481.040 3257724
4 2024-05-13 480.35 480.42 477.730 478.770 3108442
... ... ... ... ... ... ...
3441 2010-09-15 51.31 51.69 51.200 51.650 18400
3442 2010-09-14 51.42 51.74 51.190 51.519 118800
3443 2010-09-13 51.48 51.57 51.250 51.530 67400
3444 2010-09-10 50.84 50.93 50.648 50.890 17200
3445 2010-09-09 51.25 51.25 50.570 50.660 53000
[3446 rows x 6 columns]
Date open high low close volume
0 2024-05-17 531.45 532.170 529.9450 532.13 4234595
1 2024-05-16 532.53 534.180 531.1700 531.23 5526537
2 2024-05-15 528.51 532.740 527.8245 532.48 3568824
3 2024-05-14 523.70 526.425 523.1700 525.96 3288312
4 2024-05-13 525.20 525.270 522.3800 523.57 2566804
... ... ... ... ... ... ...
6032 2000-05-25 140.00 140.900 137.9000 138.50 69600
6033 2000-05-24 137.80 140.100 136.7000 139.80 400300
6034 2000-05-23 140.20 140.200 137.7000 137.70 373900
6035 2000-05-22 140.60 140.600 136.8000 139.80 1850600
6036 2000-05-19 142.70 142.700 140.3000 140.70 775500
[6037 rows x 6 columns]
Date open high low close volume
0 2024-05-17 262.15 262.30 261.2400 262.30 2208726
1 2024-05-16 262.68 263.28 261.8600 261.93 2444983
2 2024-05-15 260.85 262.73 260.4428 262.64 2508701
3 2024-05-14 258.23 259.67 258.0830 259.45 2466842
4 2024-05-13 259.08 259.08 257.5600 258.19 1948407
... ... ... ... ... ... ...
5773 2001-06-06 117.50 117.80 116.7000 116.80 278500
5774 2001-06-05 116.40 118.00 116.4000 117.80 562400
5775 2001-06-04 116.10 116.20 115.3000 116.10 1018200
5776 2001-06-01 115.10 115.90 114.4000 115.60 2542200
5777 2001-05-31 114.50 115.50 114.5000 114.80 2457200
[5778 rows x 6 columns]
Date open high low close volume
0 2024-05-17 528.8100 529.5200 527.3200 529.4500 59187585
1 2024-05-16 529.8800 531.5218 528.5400 528.6900 50244827
2 2024-05-15 525.8300 530.0800 525.1800 529.7800 59504897
3 2024-05-14 521.1100 523.8300 520.5600 523.3000 57535867
4 2024-05-13 522.5600 522.6700 519.7400 520.9100 36716361
... ... ... ... ... ... ...
6171 1999-11-05 138.6250 139.1093 136.7812 137.8750 7431500
6172 1999-11-04 136.7500 137.3593 135.7656 136.5312 7907500
6173 1999-11-03 136.0000 136.3750 135.1250 135.5000 7222300
6174 1999-11-02 135.9687 137.2500 134.5937 134.5937 6516900
6175 1999-11-01 136.5000 137.0000 135.5625 135.5625 4006500
[6176 rows x 6 columns]
Date open high low close volume
0 2024-05-17 452.11 452.720 449.5399 451.76 35803719
1 2024-05-16 452.71 454.690 451.8100 451.98 34780412
2 2024-05-15 448.43 453.150 446.9000 452.90 41464651
3 2024-05-14 442.65 446.465 442.4600 445.93 34478321
4 2024-05-13 443.99 444.090 441.6500 443.08 22994192
... ... ... ... ... ... ...
6171 1999-11-05 137.80 138.400 136.4000 136.40 7567300
6172 1999-11-04 135.40 135.600 133.6000 135.00 10024300
6173 1999-11-03 132.80 134.300 132.4000 133.50 9376300
6174 1999-11-02 131.50 133.100 130.4000 130.90 6417400
6175 1999-11-01 131.50 133.100 130.6000 130.80 4840900
[6176 rows x 6 columns]
Date open high low close volume
0 2024-05-17 486.09 486.73 484.700 486.690 3274035
1 2024-05-16 487.05 488.57 485.830 485.970 3347381
2 2024-05-15 483.34 487.23 482.735 486.900 4399485
3 2024-05-14 478.98 481.47 478.480 481.040 3257724
4 2024-05-13 480.35 480.42 477.730 478.770 3108442
... ... ... ... ... ... ...
3441 2010-09-15 51.31 51.69 51.200 51.650 18400
3442 2010-09-14 51.42 51.74 51.190 51.519 118800
3443 2010-09-13 51.48 51.57 51.250 51.530 67400
3444 2010-09-10 50.84 50.93 50.648 50.890 17200
3445 2010-09-09 51.25 51.25 50.570 50.660 53000
[3446 rows x 6 columns]
Date open high low close volume
0 2024-05-17 531.45 532.170 529.9450 532.13 4234595
1 2024-05-16 532.53 534.180 531.1700 531.23 5526537
2 2024-05-15 528.51 532.740 527.8245 532.48 3568824
3 2024-05-14 523.70 526.425 523.1700 525.96 3288312
4 2024-05-13 525.20 525.270 522.3800 523.57 2566804
... ... ... ... ... ... ...
6032 2000-05-25 140.00 140.900 137.9000 138.50 69600
6033 2000-05-24 137.80 140.100 136.7000 139.80 400300
6034 2000-05-23 140.20 140.200 137.7000 137.70 373900
6035 2000-05-22 140.60 140.600 136.8000 139.80 1850600
6036 2000-05-19 142.70 142.700 140.3000 140.70 775500
[6037 rows x 6 columns]
Date open high low close volume
0 2024-05-17 262.15 262.30 261.2400 262.30 2208726
1 2024-05-16 262.68 263.28 261.8600 261.93 2444983
2 2024-05-15 260.85 262.73 260.4428 262.64 2508701
3 2024-05-14 258.23 259.67 258.0830 259.45 2466842
4 2024-05-13 259.08 259.08 257.5600 258.19 1948407
... ... ... ... ... ... ...
5773 2001-06-06 117.50 117.80 116.7000 116.80 278500
5774 2001-06-05 116.40 118.00 116.4000 117.80 562400
5775 2001-06-04 116.10 116.20 115.3000 116.10 1018200
5776 2001-06-01 115.10 115.90 114.4000 115.60 2542200
5777 2001-05-31 114.50 115.50 114.5000 114.80 2457200
[5778 rows x 6 columns]
# Following the same process to determine the starting and ending dates to cut the index funds from
starting_dates = [df['Date'].min() for df in index_dfs.values()]
most_recent_date = starting_dates.index(max(starting_dates))
sorted_dates = sorted(starting_dates, reverse=True)
most_recent_date = sorted_dates[0]
for index_tag in index_dfs:
df = index_dfs[index_tag]
filtered_df = index_dfs[index_tag][index_dfs[index_tag]['Date'] >= most_recent_date]
filtered_df = filtered_df.reset_index(drop=True)
index_dfs[index_tag] = filtered_df
end_dates = [df['Date'].max() for df in index_dfs.values()]
least_recent_date = end_dates.index(min(end_dates))
sorted_dates = sorted(end_dates, reverse=True)
least_recent_date = sorted_dates[len(sorted_dates) - 1]
for index_tag in index_dfs:
df = index_dfs[index_tag]
filtered_df = index_dfs[index_tag][index_dfs[index_tag]['Date'] <= least_recent_date]
filtered_df = filtered_df.reset_index(drop=True)
index_dfs[index_tag] = filtered_df
for df in index_dfs.values():
print(df)
Date open high low close volume
0 2024-05-17 528.81 529.5200 527.32 529.45 59187585
1 2024-05-16 529.88 531.5218 528.54 528.69 50244827
2 2024-05-15 525.83 530.0800 525.18 529.78 59504897
3 2024-05-14 521.11 523.8300 520.56 523.30 57535867
4 2024-05-13 522.56 522.6700 519.74 520.91 36716361
... ... ... ... ... ... ...
3441 2010-09-15 112.32 113.2100 111.98 113.08 168413600
3442 2010-09-14 112.50 113.2900 112.08 112.65 209339106
3443 2010-09-13 112.58 112.9500 112.13 112.72 177874202
3444 2010-09-10 111.12 111.6100 110.87 111.48 127752166
3445 2010-09-09 111.65 111.6800 110.62 110.92 146442924
[3446 rows x 6 columns]
Date open high low close volume
0 2024-05-17 452.1100 452.720 449.5399 451.76 35803719
1 2024-05-16 452.7100 454.690 451.8100 451.98 34780412
2 2024-05-15 448.4300 453.150 446.9000 452.90 41464651
3 2024-05-14 442.6500 446.465 442.4600 445.93 34478321
4 2024-05-13 443.9900 444.090 441.6500 443.08 22994192
... ... ... ... ... ... ...
3441 2010-09-15 47.3500 47.800 47.1900 47.75 67108200
3442 2010-09-14 47.1700 47.700 47.0800 47.45 91405800
3443 2010-09-13 46.9500 47.370 46.9400 47.25 64523400
3444 2010-09-10 46.4900 46.670 46.2500 46.60 66995700
3445 2010-09-09 46.6076 46.700 46.3200 46.43 61476400
[3446 rows x 6 columns]
Date open high low close volume
0 2024-05-17 486.09 486.73 484.700 486.690 3274035
1 2024-05-16 487.05 488.57 485.830 485.970 3347381
2 2024-05-15 483.34 487.23 482.735 486.900 4399485
3 2024-05-14 478.98 481.47 478.480 481.040 3257724
4 2024-05-13 480.35 480.42 477.730 478.770 3108442
... ... ... ... ... ... ...
3441 2010-09-15 51.31 51.69 51.200 51.650 18400
3442 2010-09-14 51.42 51.74 51.190 51.519 118800
3443 2010-09-13 51.48 51.57 51.250 51.530 67400
3444 2010-09-10 50.84 50.93 50.648 50.890 17200
3445 2010-09-09 51.25 51.25 50.570 50.660 53000
[3446 rows x 6 columns]
Date open high low close volume
0 2024-05-17 531.45 532.1700 529.9450 532.13 4234595
1 2024-05-16 532.53 534.1800 531.1700 531.23 5526537
2 2024-05-15 528.51 532.7400 527.8245 532.48 3568824
3 2024-05-14 523.70 526.4250 523.1700 525.96 3288312
4 2024-05-13 525.20 525.2700 522.3800 523.57 2566804
... ... ... ... ... ... ...
3441 2010-09-15 112.73 113.6000 112.3900 113.49 2943400
3442 2010-09-14 112.94 113.7100 112.4800 113.07 2233200
3443 2010-09-13 112.99 113.3474 112.5500 113.16 1964400
3444 2010-09-10 111.53 112.0000 111.2600 111.93 1789100
3445 2010-09-09 112.06 112.0700 111.0200 111.30 1601600
[3446 rows x 6 columns]
Date open high low close volume
0 2024-05-17 262.15 262.30 261.2400 262.30 2208726
1 2024-05-16 262.68 263.28 261.8600 261.93 2444983
2 2024-05-15 260.85 262.73 260.4428 262.64 2508701
3 2024-05-14 258.23 259.67 258.0830 259.45 2466842
4 2024-05-13 259.08 259.08 257.5600 258.19 1948407
... ... ... ... ... ... ...
3441 2010-09-15 57.19 57.67 57.0300 57.59 1513700
3442 2010-09-14 57.30 57.72 57.0900 57.40 1262900
3443 2010-09-13 57.28 57.49 57.1200 57.41 1167500
3444 2010-09-10 56.57 56.78 56.4240 56.69 706800
3445 2010-09-09 56.78 56.90 56.2810 56.45 1433300
[3446 rows x 6 columns]
# Pulling only Date and closing price information from each index fund
for index_tag in index_dfs:
index_dfs[index_tag] = index_dfs[index_tag][['Date', 'close']]
for df in index_dfs.values():
print(df)
Date close
0 2024-05-17 529.45
1 2024-05-16 528.69
2 2024-05-15 529.78
3 2024-05-14 523.30
4 2024-05-13 520.91
... ... ...
3441 2010-09-15 113.08
3442 2010-09-14 112.65
3443 2010-09-13 112.72
3444 2010-09-10 111.48
3445 2010-09-09 110.92
[3446 rows x 2 columns]
Date close
0 2024-05-17 451.76
1 2024-05-16 451.98
2 2024-05-15 452.90
3 2024-05-14 445.93
4 2024-05-13 443.08
... ... ...
3441 2010-09-15 47.75
3442 2010-09-14 47.45
3443 2010-09-13 47.25
3444 2010-09-10 46.60
3445 2010-09-09 46.43
[3446 rows x 2 columns]
Date close
0 2024-05-17 486.690
1 2024-05-16 485.970
2 2024-05-15 486.900
3 2024-05-14 481.040
4 2024-05-13 478.770
... ... ...
3441 2010-09-15 51.650
3442 2010-09-14 51.519
3443 2010-09-13 51.530
3444 2010-09-10 50.890
3445 2010-09-09 50.660
[3446 rows x 2 columns]
Date close
0 2024-05-17 532.13
1 2024-05-16 531.23
2 2024-05-15 532.48
3 2024-05-14 525.96
4 2024-05-13 523.57
... ... ...
3441 2010-09-15 113.49
3442 2010-09-14 113.07
3443 2010-09-13 113.16
3444 2010-09-10 111.93
3445 2010-09-09 111.30
[3446 rows x 2 columns]
Date close
0 2024-05-17 262.30
1 2024-05-16 261.93
2 2024-05-15 262.64
3 2024-05-14 259.45
4 2024-05-13 258.19
... ... ...
3441 2010-09-15 57.59
3442 2010-09-14 57.40
3443 2010-09-13 57.41
3444 2010-09-10 56.69
3445 2010-09-09 56.45
[3446 rows x 2 columns]
# Renaming each index funds close column to be the name of the index fund
for index_tag in index_dfs:
df = index_dfs[index_tag].copy()
df.rename(columns={"close": index_tag}, inplace=True)
index_dfs[index_tag] = df
for df in index_dfs.values():
print(df)
Date SPY
0 2024-05-17 529.45
1 2024-05-16 528.69
2 2024-05-15 529.78
3 2024-05-14 523.30
4 2024-05-13 520.91
... ... ...
3441 2010-09-15 113.08
3442 2010-09-14 112.65
3443 2010-09-13 112.72
3444 2010-09-10 111.48
3445 2010-09-09 110.92
[3446 rows x 2 columns]
Date QQQ
0 2024-05-17 451.76
1 2024-05-16 451.98
2 2024-05-15 452.90
3 2024-05-14 445.93
4 2024-05-13 443.08
... ... ...
3441 2010-09-15 47.75
3442 2010-09-14 47.45
3443 2010-09-13 47.25
3444 2010-09-10 46.60
3445 2010-09-09 46.43
[3446 rows x 2 columns]
Date VOO
0 2024-05-17 486.690
1 2024-05-16 485.970
2 2024-05-15 486.900
3 2024-05-14 481.040
4 2024-05-13 478.770
... ... ...
3441 2010-09-15 51.650
3442 2010-09-14 51.519
3443 2010-09-13 51.530
3444 2010-09-10 50.890
3445 2010-09-09 50.660
[3446 rows x 2 columns]
Date IVV
0 2024-05-17 532.13
1 2024-05-16 531.23
2 2024-05-15 532.48
3 2024-05-14 525.96
4 2024-05-13 523.57
... ... ...
3441 2010-09-15 113.49
3442 2010-09-14 113.07
3443 2010-09-13 113.16
3444 2010-09-10 111.93
3445 2010-09-09 111.30
[3446 rows x 2 columns]
Date VTI
0 2024-05-17 262.30
1 2024-05-16 261.93
2 2024-05-15 262.64
3 2024-05-14 259.45
4 2024-05-13 258.19
... ... ...
3441 2010-09-15 57.59
3442 2010-09-14 57.40
3443 2010-09-13 57.41
3444 2010-09-10 56.69
3445 2010-09-09 56.45
[3446 rows x 2 columns]
# Combining the 5 index fund data frames into a single data frame
index_df = index_dfs['SPY']
for index_tag in index_dfs:
if index_tag != 'SPY':
index_df = pd.merge(index_df, index_dfs[index_tag], on='Date')
print(index_df)
Date SPY QQQ VOO IVV VTI 0 2024-05-17 529.45 451.76 486.690 532.13 262.30 1 2024-05-16 528.69 451.98 485.970 531.23 261.93 2 2024-05-15 529.78 452.90 486.900 532.48 262.64 3 2024-05-14 523.30 445.93 481.040 525.96 259.45 4 2024-05-13 520.91 443.08 478.770 523.57 258.19 ... ... ... ... ... ... ... 3441 2010-09-15 113.08 47.75 51.650 113.49 57.59 3442 2010-09-14 112.65 47.45 51.519 113.07 57.40 3443 2010-09-13 112.72 47.25 51.530 113.16 57.41 3444 2010-09-10 111.48 46.60 50.890 111.93 56.69 3445 2010-09-09 110.92 46.43 50.660 111.30 56.45 [3446 rows x 6 columns]
# Reversing the rows of the index fund data frame so it is in the same order as the indicator data
index_df = index_df.iloc[::-1].reset_index(drop=True)
print(index_df)
Date SPY QQQ VOO IVV VTI 0 2010-09-09 110.92 46.43 50.660 111.30 56.45 1 2010-09-10 111.48 46.60 50.890 111.93 56.69 2 2010-09-13 112.72 47.25 51.530 113.16 57.41 3 2010-09-14 112.65 47.45 51.519 113.07 57.40 4 2010-09-15 113.08 47.75 51.650 113.49 57.59 ... ... ... ... ... ... ... 3441 2024-05-13 520.91 443.08 478.770 523.57 258.19 3442 2024-05-14 523.30 445.93 481.040 525.96 259.45 3443 2024-05-15 529.78 452.90 486.900 532.48 262.64 3444 2024-05-16 528.69 451.98 485.970 531.23 261.93 3445 2024-05-17 529.45 451.76 486.690 532.13 262.30 [3446 rows x 6 columns]
print(index_df)
print(indicator_df)
Date SPY QQQ VOO IVV VTI
0 2010-09-09 110.92 46.43 50.660 111.30 56.45
1 2010-09-10 111.48 46.60 50.890 111.93 56.69
2 2010-09-13 112.72 47.25 51.530 113.16 57.41
3 2010-09-14 112.65 47.45 51.519 113.07 57.40
4 2010-09-15 113.08 47.75 51.650 113.49 57.59
... ... ... ... ... ... ...
3441 2024-05-13 520.91 443.08 478.770 523.57 258.19
3442 2024-05-14 523.30 445.93 481.040 525.96 259.45
3443 2024-05-15 529.78 452.90 486.900 532.48 262.64
3444 2024-05-16 528.69 451.98 485.970 531.23 261.93
3445 2024-05-17 529.45 451.76 486.690 532.13 262.30
[3446 rows x 6 columns]
Date Daily Federal Funds Rate Interest Rate on Reserve Balances \
0 2010-02-01 0.14 0.0
1 2010-02-02 0.14 0.0
2 2010-02-03 0.13 0.0
3 2010-02-04 0.14 0.0
4 2010-02-05 0.13 0.0
... ... ... ...
3712 2024-04-24 5.33 5.4
3713 2024-04-25 5.33 5.4
3714 2024-04-26 5.33 5.4
3715 2024-04-29 5.33 5.4
3716 2024-04-30 5.33 5.4
Nominal Broad US Dollar Index Inflation Rate \
0 93.7321 2.460000
1 93.4136 2.461111
2 93.6268 2.462222
3 94.1401 2.463333
4 94.6218 2.464444
... ... ...
3712 123.2398 2.350000
3713 123.1916 2.350000
3714 123.3004 2.350000
3715 122.9074 2.350000
3716 123.3446 2.350000
Federal Funds Target Upper Limit Federal Funds Target Lower Limit \
0 0.25 0.00
1 0.25 0.00
2 0.25 0.00
3 0.25 0.00
4 0.25 0.00
... ... ...
3712 5.50 5.25
3713 5.50 5.25
3714 5.50 5.25
3715 5.50 5.25
3716 5.50 5.25
Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity
0 3.58
1 3.57
2 3.63
3 3.53
4 3.49
... ...
3712 -0.81
3713 -0.77
3714 -0.79
3715 -0.82
3716 -0.77
[3717 rows x 8 columns]
# Using the same process as before to determine the starting and ending dates to cut
# from so the index fund and indicator data frames start and end at the same date
starting_date = pd.to_datetime(max(index_df['Date'].min(), indicator_df['Date'].min()))
ending_date = pd.to_datetime(min(index_df['Date'].max(), indicator_df['Date'].max()))
index_df = index_df[index_df['Date'] >= starting_date]
indicator_df = indicator_df[indicator_df['Date'] >= starting_date]
index_df = index_df[index_df['Date'] <= ending_date]
indicator_df = indicator_df[indicator_df['Date'] <= ending_date]
index_df = index_df.reset_index(drop=True)
indicator_df = indicator_df.reset_index(drop=True)
print(index_df)
print(indicator_df)
Date SPY QQQ VOO IVV VTI
0 2010-09-09 110.92 46.43 50.660 111.30 56.45
1 2010-09-10 111.48 46.60 50.890 111.93 56.69
2 2010-09-13 112.72 47.25 51.530 113.16 57.41
3 2010-09-14 112.65 47.45 51.519 113.07 57.40
4 2010-09-15 113.08 47.75 51.650 113.49 57.59
... ... ... ... ... ... ...
3428 2024-04-24 505.41 426.51 464.500 507.97 250.65
3429 2024-04-25 503.49 424.45 462.580 505.82 249.46
3430 2024-04-26 508.26 431.00 467.210 510.77 251.78
3431 2024-04-29 510.06 432.75 468.840 512.59 252.77
3432 2024-04-30 501.98 424.59 461.430 504.44 248.61
[3433 rows x 6 columns]
Date Daily Federal Funds Rate Interest Rate on Reserve Balances \
0 2010-09-09 0.18 0.0
1 2010-09-10 0.18 0.0
2 2010-09-13 0.18 0.0
3 2010-09-14 0.19 0.0
4 2010-09-15 0.21 0.0
... ... ... ...
3554 2024-04-24 5.33 5.4
3555 2024-04-25 5.33 5.4
3556 2024-04-26 5.33 5.4
3557 2024-04-29 5.33 5.4
3558 2024-04-30 5.33 5.4
Nominal Broad US Dollar Index Inflation Rate \
0 93.7033 2.198276
1 93.6768 2.209310
2 93.0767 2.242414
3 92.6182 2.253448
4 92.9313 2.264483
... ... ...
3554 123.2398 2.350000
3555 123.1916 2.350000
3556 123.3004 2.350000
3557 122.9074 2.350000
3558 123.3446 2.350000
Federal Funds Target Upper Limit Federal Funds Target Lower Limit \
0 0.25 0.00
1 0.25 0.00
2 0.25 0.00
3 0.25 0.00
4 0.25 0.00
... ... ...
3554 5.50 5.25
3555 5.50 5.25
3556 5.50 5.25
3557 5.50 5.25
3558 5.50 5.25
Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity
0 2.63
1 2.67
2 2.59
3 2.53
4 2.59
... ...
3554 -0.81
3555 -0.77
3556 -0.79
3557 -0.82
3558 -0.77
[3559 rows x 8 columns]
Looking at the number of rows between the index and indicator data frames we notice that the indicators have more rows. This is because the stock market will be closed on random days through the year for holidays and such so we need to remove these days as well.
# Checking for unique rows between the index fund and indicator data frames
unique_dates_df1 = set(index_df['Date']) - set(indicator_df['Date'])
unique_dates_df2 = set(indicator_df['Date']) - set(index_df['Date'])
print("Unique Dates in DF1:", unique_dates_df1)
print("Unique Dates in DF2:", unique_dates_df2)
Unique Dates in DF1: set()
Unique Dates in DF2: {Timestamp('2017-04-14 00:00:00'), Timestamp('2024-03-29 00:00:00'), Timestamp('2011-04-22 00:00:00'), Timestamp('2020-02-17 00:00:00'), Timestamp('2013-05-27 00:00:00'), Timestamp('2018-12-05 00:00:00'), Timestamp('2021-01-18 00:00:00'), Timestamp('2012-01-02 00:00:00'), Timestamp('2022-11-24 00:00:00'), Timestamp('2016-02-15 00:00:00'), Timestamp('2021-12-24 00:00:00'), Timestamp('2018-07-04 00:00:00'), Timestamp('2021-07-05 00:00:00'), Timestamp('2017-09-04 00:00:00'), Timestamp('2016-01-18 00:00:00'), Timestamp('2014-02-17 00:00:00'), Timestamp('2013-11-28 00:00:00'), Timestamp('2014-12-25 00:00:00'), Timestamp('2022-06-20 00:00:00'), Timestamp('2010-11-25 00:00:00'), Timestamp('2018-01-15 00:00:00'), Timestamp('2021-05-31 00:00:00'), Timestamp('2019-05-27 00:00:00'), Timestamp('2017-01-16 00:00:00'), Timestamp('2013-09-02 00:00:00'), Timestamp('2012-07-04 00:00:00'), Timestamp('2017-07-04 00:00:00'), Timestamp('2012-02-20 00:00:00'), Timestamp('2011-11-24 00:00:00'), Timestamp('2017-12-25 00:00:00'), Timestamp('2010-12-24 00:00:00'), Timestamp('2017-11-23 00:00:00'), Timestamp('2023-04-07 00:00:00'), Timestamp('2018-05-28 00:00:00'), Timestamp('2013-01-21 00:00:00'), Timestamp('2011-12-26 00:00:00'), Timestamp('2015-12-25 00:00:00'), Timestamp('2018-12-25 00:00:00'), Timestamp('2021-09-06 00:00:00'), Timestamp('2015-04-03 00:00:00'), Timestamp('2023-02-20 00:00:00'), Timestamp('2021-11-25 00:00:00'), Timestamp('2012-10-29 00:00:00'), Timestamp('2020-04-10 00:00:00'), Timestamp('2019-01-21 00:00:00'), Timestamp('2015-11-26 00:00:00'), Timestamp('2023-01-02 00:00:00'), Timestamp('2017-05-29 00:00:00'), Timestamp('2023-05-29 00:00:00'), Timestamp('2012-11-22 00:00:00'), Timestamp('2020-01-20 00:00:00'), Timestamp('2014-01-01 00:00:00'), Timestamp('2013-02-18 00:00:00'), Timestamp('2022-05-30 00:00:00'), Timestamp('2022-09-05 00:00:00'), Timestamp('2014-11-27 00:00:00'), Timestamp('2021-02-15 00:00:00'), Timestamp('2014-07-04 00:00:00'), Timestamp('2016-07-04 00:00:00'), Timestamp('2019-12-25 00:00:00'), Timestamp('2017-02-20 00:00:00'), Timestamp('2018-03-30 00:00:00'), Timestamp('2017-01-02 00:00:00'), Timestamp('2018-09-03 00:00:00'), Timestamp('2020-01-01 00:00:00'), Timestamp('2023-11-23 00:00:00'), Timestamp('2013-07-04 00:00:00'), Timestamp('2022-02-21 00:00:00'), Timestamp('2018-11-22 00:00:00'), Timestamp('2016-01-01 00:00:00'), Timestamp('2016-12-26 00:00:00'), Timestamp('2012-04-06 00:00:00'), Timestamp('2012-12-25 00:00:00'), Timestamp('2022-12-26 00:00:00'), Timestamp('2013-03-29 00:00:00'), Timestamp('2014-01-20 00:00:00'), Timestamp('2016-09-05 00:00:00'), Timestamp('2020-11-26 00:00:00'), Timestamp('2011-07-04 00:00:00'), Timestamp('2015-07-03 00:00:00'), Timestamp('2024-01-01 00:00:00'), Timestamp('2020-05-25 00:00:00'), Timestamp('2020-09-07 00:00:00'), Timestamp('2020-12-25 00:00:00'), Timestamp('2021-04-02 00:00:00'), Timestamp('2023-06-19 00:00:00'), Timestamp('2012-05-28 00:00:00'), Timestamp('2023-09-04 00:00:00'), Timestamp('2023-01-16 00:00:00'), Timestamp('2018-01-01 00:00:00'), Timestamp('2019-07-04 00:00:00'), Timestamp('2019-02-18 00:00:00'), Timestamp('2019-09-02 00:00:00'), Timestamp('2013-12-25 00:00:00'), Timestamp('2024-02-19 00:00:00'), Timestamp('2016-03-25 00:00:00'), Timestamp('2022-04-15 00:00:00'), Timestamp('2011-05-30 00:00:00'), Timestamp('2011-09-05 00:00:00'), Timestamp('2022-07-04 00:00:00'), Timestamp('2016-05-30 00:00:00'), Timestamp('2024-01-15 00:00:00'), Timestamp('2023-12-25 00:00:00'), Timestamp('2014-09-01 00:00:00'), Timestamp('2018-02-19 00:00:00'), Timestamp('2012-10-30 00:00:00'), Timestamp('2019-04-19 00:00:00'), Timestamp('2019-01-01 00:00:00'), Timestamp('2016-11-24 00:00:00'), Timestamp('2015-02-16 00:00:00'), Timestamp('2012-01-16 00:00:00'), Timestamp('2011-01-17 00:00:00'), Timestamp('2020-07-03 00:00:00'), Timestamp('2014-05-26 00:00:00'), Timestamp('2015-01-19 00:00:00'), Timestamp('2022-01-17 00:00:00'), Timestamp('2015-01-01 00:00:00'), Timestamp('2015-09-07 00:00:00'), Timestamp('2015-05-25 00:00:00'), Timestamp('2021-01-01 00:00:00'), Timestamp('2013-01-01 00:00:00'), Timestamp('2014-04-18 00:00:00'), Timestamp('2012-09-03 00:00:00'), Timestamp('2011-02-21 00:00:00'), Timestamp('2019-11-28 00:00:00'), Timestamp('2023-07-04 00:00:00')}
# Removing unique rows from each data frame
# isin() creates a True/False mask that when applied to the data frame tells it which cells to remove
mask_index = index_df['Date'].isin(indicator_df['Date'])
index_df = index_df[mask_index].reset_index(drop=True)
mask_indicator = indicator_df['Date'].isin(index_df['Date'])
indicator_df = indicator_df[mask_indicator].reset_index(drop=True)
print(index_df)
print(indicator_df)
Date SPY QQQ VOO IVV VTI
0 2010-09-09 110.92 46.43 50.660 111.30 56.45
1 2010-09-10 111.48 46.60 50.890 111.93 56.69
2 2010-09-13 112.72 47.25 51.530 113.16 57.41
3 2010-09-14 112.65 47.45 51.519 113.07 57.40
4 2010-09-15 113.08 47.75 51.650 113.49 57.59
... ... ... ... ... ... ...
3428 2024-04-24 505.41 426.51 464.500 507.97 250.65
3429 2024-04-25 503.49 424.45 462.580 505.82 249.46
3430 2024-04-26 508.26 431.00 467.210 510.77 251.78
3431 2024-04-29 510.06 432.75 468.840 512.59 252.77
3432 2024-04-30 501.98 424.59 461.430 504.44 248.61
[3433 rows x 6 columns]
Date Daily Federal Funds Rate Interest Rate on Reserve Balances \
0 2010-09-09 0.18 0.0
1 2010-09-10 0.18 0.0
2 2010-09-13 0.18 0.0
3 2010-09-14 0.19 0.0
4 2010-09-15 0.21 0.0
... ... ... ...
3428 2024-04-24 5.33 5.4
3429 2024-04-25 5.33 5.4
3430 2024-04-26 5.33 5.4
3431 2024-04-29 5.33 5.4
3432 2024-04-30 5.33 5.4
Nominal Broad US Dollar Index Inflation Rate \
0 93.7033 2.198276
1 93.6768 2.209310
2 93.0767 2.242414
3 92.6182 2.253448
4 92.9313 2.264483
... ... ...
3428 123.2398 2.350000
3429 123.1916 2.350000
3430 123.3004 2.350000
3431 122.9074 2.350000
3432 123.3446 2.350000
Federal Funds Target Upper Limit Federal Funds Target Lower Limit \
0 0.25 0.00
1 0.25 0.00
2 0.25 0.00
3 0.25 0.00
4 0.25 0.00
... ... ...
3428 5.50 5.25
3429 5.50 5.25
3430 5.50 5.25
3431 5.50 5.25
3432 5.50 5.25
Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity
0 2.63
1 2.67
2 2.59
3 2.53
4 2.59
... ...
3428 -0.81
3429 -0.77
3430 -0.79
3431 -0.82
3432 -0.77
[3433 rows x 8 columns]
Exploratory Analysis and Data Visualization
Exploratory analysis is the third step in the data science pipeline and focuses on determining parameters to understand the data. Exploratory analysis is useful for understanding how to design models based off relationships between data points. Data visualization is useful for giving an intuition for how the data moves over time, distance, etc. It can also be useful for visually determining relationships. For example, seeing that two curves roughly trend upwards at the same rate.
For our case we will be using exploratory analysis in order to determine which Economic Indicators will help us create the best predictive model for the value of our Index Funds.
With the plotting functions we will be using it will be significantly simpler to have the Date column as the index rather than its own column. We can do this using set_index which just move the column.
# Creating data frames where the index is the date to simplify graphing
pindex_df = index_df.set_index('Date')
pindicator_df = indicator_df.set_index('Date')
print(pindex_df)
print(pindicator_df)
SPY QQQ VOO IVV VTI
Date
2010-09-09 110.92 46.43 50.660 111.30 56.45
2010-09-10 111.48 46.60 50.890 111.93 56.69
2010-09-13 112.72 47.25 51.530 113.16 57.41
2010-09-14 112.65 47.45 51.519 113.07 57.40
2010-09-15 113.08 47.75 51.650 113.49 57.59
... ... ... ... ... ...
2024-04-24 505.41 426.51 464.500 507.97 250.65
2024-04-25 503.49 424.45 462.580 505.82 249.46
2024-04-26 508.26 431.00 467.210 510.77 251.78
2024-04-29 510.06 432.75 468.840 512.59 252.77
2024-04-30 501.98 424.59 461.430 504.44 248.61
[3433 rows x 5 columns]
Daily Federal Funds Rate Interest Rate on Reserve Balances \
Date
2010-09-09 0.18 0.0
2010-09-10 0.18 0.0
2010-09-13 0.18 0.0
2010-09-14 0.19 0.0
2010-09-15 0.21 0.0
... ... ...
2024-04-24 5.33 5.4
2024-04-25 5.33 5.4
2024-04-26 5.33 5.4
2024-04-29 5.33 5.4
2024-04-30 5.33 5.4
Nominal Broad US Dollar Index Inflation Rate \
Date
2010-09-09 93.7033 2.198276
2010-09-10 93.6768 2.209310
2010-09-13 93.0767 2.242414
2010-09-14 92.6182 2.253448
2010-09-15 92.9313 2.264483
... ... ...
2024-04-24 123.2398 2.350000
2024-04-25 123.1916 2.350000
2024-04-26 123.3004 2.350000
2024-04-29 122.9074 2.350000
2024-04-30 123.3446 2.350000
Federal Funds Target Upper Limit \
Date
2010-09-09 0.25
2010-09-10 0.25
2010-09-13 0.25
2010-09-14 0.25
2010-09-15 0.25
... ...
2024-04-24 5.50
2024-04-25 5.50
2024-04-26 5.50
2024-04-29 5.50
2024-04-30 5.50
Federal Funds Target Lower Limit \
Date
2010-09-09 0.00
2010-09-10 0.00
2010-09-13 0.00
2010-09-14 0.00
2010-09-15 0.00
... ...
2024-04-24 5.25
2024-04-25 5.25
2024-04-26 5.25
2024-04-29 5.25
2024-04-30 5.25
Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity
Date
2010-09-09 2.63
2010-09-10 2.67
2010-09-13 2.59
2010-09-14 2.53
2010-09-15 2.59
... ...
2024-04-24 -0.81
2024-04-25 -0.77
2024-04-26 -0.79
2024-04-29 -0.82
2024-04-30 -0.77
[3433 rows x 7 columns]
We will first be plotting all 5 Index Funds against the individual Economic Indicators to try and determine visual relationships. By using the matplotlib library we can call .plot on the data frames which will automatically put the data on the y-axis and the dates on the x-axis. To be able to put the data from two data frames onto a single plot we save the plot as "ax" and specify "ax=ax" in the second plot. We also set secondary_y to True so that the indicators scale equally with the index funds.
# Plotting the 5 index funds with each indicator individually to be which indicators can be discarded visually
for column in pindicator_df.columns:
ax = pindex_df.plot(title="Index Fund Value", legend=True, ylabel="Price", figsize=(20,15))
pindicator_df[column].plot(ax=ax, legend=True, secondary_y=True)
By inspection we can determine that all but the Inflation Rate and the Yield Curve have a roughly positive linear relationship as the Indicators clearly increase over time. The Yield Curve does have a negative relationship though as we can see it trends down over time while the Index Funds all trend up. The Inflation Rate however has no noticeable relationship as it appears to to start trending down and then trends upward.
In order to determine which indicators to use for our model we will be running a linear regression. However running a regression on noisy data can lead to outliers being pronounced. To avoid this we will be using a gaussian blur, which can be thought of as a moving average, in order to remove the noise from our data. We first blur the Index Funds using the built in blurring function from scipy.
# Smoothing the index fund data to reduce noise and modeling easier
sigma = 7
blur_pindex_df = pd.DataFrame(index=pindex_df.index)
for column in pindex_df.columns:
# Creating the blurred column of data
blur_pindex_df[f'Blurred {column}'] = gaussian_filter1d(pindex_df[column], sigma=sigma)
# Plotting the blurred column vs the original
# By adjusting sigma you can fine tune until the noise is removed but the general shape of the curve is the same
plt.figure(figsize=(20, 15))
ax = pindex_df[column].plot(title="Index Fund Value", legend=True, ylabel="Price")
blur_pindex_df[f'Blurred {column}'].plot(ax=ax, legend=True, secondary_y=True)
print(blur_pindex_df)
Blurred SPY Blurred QQQ Blurred VOO Blurred IVV Blurred VTI Date 2010-09-09 112.828632 47.817171 51.627469 113.400644 57.561600 2010-09-10 112.851878 47.838002 51.638719 113.424914 57.575681 2010-09-13 112.897843 47.878857 51.660881 113.472724 57.603447 2010-09-14 112.965472 47.938201 51.693285 113.542634 57.644114 2010-09-15 113.053308 48.013846 51.735016 113.632675 57.696603 ... ... ... ... ... ... 2024-04-24 505.459308 428.024694 464.543335 507.929580 250.431317 2024-04-25 505.203208 427.670252 464.308177 507.671895 250.299455 2024-04-26 505.019436 427.408407 464.139583 507.487031 250.205106 2024-04-29 504.901279 427.236115 464.031263 507.368193 250.144587 2024-04-30 504.843623 427.150769 463.978440 507.310197 250.115111 [3433 rows x 5 columns]
Next we blur the Yield Curve and US Dollar Value data since these are the only indicators which are not step functions and have noise.
# Smoothing the Yield Curve and Dollar Value data to reduce noise
blur_pindicator_df = pindicator_df.drop(columns=['Inflation Rate'])
blur_pindicator_df.index = pindicator_df.index
blur_pindicator_df['Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity'] = gaussian_filter1d(pindicator_df['Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity'], sigma=sigma)
blur_pindicator_df['Nominal Broad US Dollar Index'] = gaussian_filter1d(pindicator_df['Nominal Broad US Dollar Index'], sigma=sigma)
Lastly before creating our regressions we want to average out our index fund data since they follow similar trends. This will help simplify the regression. Before we compute the average however, we normalize all the Index Fund data between 0 and 1 in order to ensure higher value index funds do not overwhelm lower valued funds.
# Normalizing the index fund data so higher valued funds do not swamp the mean
norm_blur_pindex_df = (blur_pindex_df - blur_pindex_df.min()) / (blur_pindex_df.max() - blur_pindex_df.min())
print(norm_blur_pindex_df)
Blurred SPY Blurred QQQ Blurred VOO Blurred IVV Blurred VTI Date 2010-09-09 0.000000 0.000000 0.000000 0.000000 0.000000 2010-09-10 0.000057 0.000053 0.000027 0.000060 0.000071 2010-09-13 0.000171 0.000157 0.000079 0.000177 0.000210 2010-09-14 0.000338 0.000307 0.000155 0.000349 0.000414 2010-09-15 0.000555 0.000500 0.000253 0.000570 0.000677 ... ... ... ... ... ... 2024-04-24 0.970360 0.965979 0.973182 0.969736 0.967235 2024-04-25 0.969727 0.965078 0.972628 0.969102 0.966574 2024-04-26 0.969273 0.964413 0.972231 0.968648 0.966100 2024-04-29 0.968981 0.963975 0.971976 0.968356 0.965797 2024-04-30 0.968838 0.963758 0.971851 0.968213 0.965649 [3433 rows x 5 columns]
# Pooling the index fund data by taking a mean of the rows
row_means = norm_blur_pindex_df.mean(axis=1)
mean_pindex_df = pd.DataFrame(row_means, columns=['Index Fund Means'])
mean_pindex_df.index = norm_blur_pindex_df.index
print(mean_pindex_df)
Index Fund Means Date 2010-09-09 0.000000 2010-09-10 0.000053 2010-09-13 0.000159 2010-09-14 0.000313 2010-09-15 0.000511 ... ... 2024-04-24 0.969298 2024-04-25 0.968622 2024-04-26 0.968133 2024-04-29 0.967817 2024-04-30 0.967662 [3433 rows x 1 columns]
Now we run our regression. A linear regression calculates an optimized y-intercept and slope in order to approximate a data set using a linear curve. This allows to do things. One is we know have a simple model to make predictions about the future. Two is we can get an idea of how related two data sets are using the $R^2$ value. The $R^2$ value will be between 0 and 1 and the closer 1 it is the more related we can assume those two factors are. In our case we will be creating 1 regression per Economic Indicator and calculating the associated $R^2$ value to choose which Indicators to use in our machine learning model.
Since the regressions take the form y = mx+b we have to calculate two values, m and b. Typically these are called $\beta_1$ and $\beta_0$ rather than m and b respectively.
To calculate these two values we use the formulas below.
$$ \beta_1 = \frac{\sum_{i=1}^{n} (x_i - \overline{x})(y_i - \overline{y})}{\sum_{i=1}^{n} (x_i - \overline{x})^2} $$$$ \beta_0 = \overline{y} - \beta_1 \overline{x} $$To calculate the $R^2$ value we use
$$ R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \overline{y})^2} $$To simplify our code we will be using the built in LinearRegression model from scipy to compute these values.
Regression Models: https://en.wikipedia.org/wiki/Linear_regression
$R^2$: https://en.wikipedia.org/wiki/Coefficient_of_determination
# Creating linear regressions for each indicator against the pooled index fund data to determine
# the best predictors for the machine learning model
model = LinearRegression()
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(12, 12))
axes = axes.flatten()
for i, column in enumerate(blur_pindicator_df.columns):
# Prepare x and y data to plot
X = blur_pindicator_df[[column]] # Predictor
y = mean_pindex_df['Index Fund Means'] # Response
# Fit linear regression model
model.fit(X, y)
# Make predicted linear curve based off model
y_pred = model.predict(X)
# Calculate R^2 value to get an idea of accuracy
r_squared = model.score(X, y)
# Plotting data and a linear regression line
ax = axes[i]
ax.scatter(X, y, color='blue', label='Data')
ax.plot(X, y_pred, color='red', label=f'Linear Reg Line, $R^2={r_squared:.2f}$') # Include R^2 in the legend
ax.set_title(f'Linear Regression of Index Fund Means vs. {column}')
ax.set_xlabel(column)
ax.set_ylabel('Normalized Index Fund Means')
ax.legend()
plt.tight_layout()
plt.show()
Modeling
Modeling is the idea of making predictions about the future or missing information from the past based off information we have right now. This can be done by drawing relationships between various data sets like what we did with the regression or by creating a model that we can train on existing data.
Looking at our 6 regression models we see the two best $R^2$ values are the Yield Curve and US Dollar Value and thus we will be using the two Indicators for our machine learning model
Before creating our model we will create training and testing data. Training data is data used to improve the quality of the model by giving it inputs and comparing the predictions the model makes with known outputs and then adjusting the model based on the inaccuracy of the model. Testing data is existing data we use to "test" the accuracy of our model. The distinction between the two types of data is extremely important as the model is given the correct answers for the training data while it has to figure out the right answer to the test data.
# Creating a master_df to make training and testing data sampling easier
master_df = pd.merge(index_df, indicator_df)
master_df = master_df.set_index('Date')
print(master_df)
SPY QQQ VOO IVV VTI Daily Federal Funds Rate \
Date
2010-09-09 110.92 46.43 50.660 111.30 56.45 0.18
2010-09-10 111.48 46.60 50.890 111.93 56.69 0.18
2010-09-13 112.72 47.25 51.530 113.16 57.41 0.18
2010-09-14 112.65 47.45 51.519 113.07 57.40 0.19
2010-09-15 113.08 47.75 51.650 113.49 57.59 0.21
... ... ... ... ... ... ...
2024-04-24 505.41 426.51 464.500 507.97 250.65 5.33
2024-04-25 503.49 424.45 462.580 505.82 249.46 5.33
2024-04-26 508.26 431.00 467.210 510.77 251.78 5.33
2024-04-29 510.06 432.75 468.840 512.59 252.77 5.33
2024-04-30 501.98 424.59 461.430 504.44 248.61 5.33
Interest Rate on Reserve Balances Nominal Broad US Dollar Index \
Date
2010-09-09 0.0 93.7033
2010-09-10 0.0 93.6768
2010-09-13 0.0 93.0767
2010-09-14 0.0 92.6182
2010-09-15 0.0 92.9313
... ... ...
2024-04-24 5.4 123.2398
2024-04-25 5.4 123.1916
2024-04-26 5.4 123.3004
2024-04-29 5.4 122.9074
2024-04-30 5.4 123.3446
Inflation Rate Federal Funds Target Upper Limit \
Date
2010-09-09 2.198276 0.25
2010-09-10 2.209310 0.25
2010-09-13 2.242414 0.25
2010-09-14 2.253448 0.25
2010-09-15 2.264483 0.25
... ... ...
2024-04-24 2.350000 5.50
2024-04-25 2.350000 5.50
2024-04-26 2.350000 5.50
2024-04-29 2.350000 5.50
2024-04-30 2.350000 5.50
Federal Funds Target Lower Limit \
Date
2010-09-09 0.00
2010-09-10 0.00
2010-09-13 0.00
2010-09-14 0.00
2010-09-15 0.00
... ...
2024-04-24 5.25
2024-04-25 5.25
2024-04-26 5.25
2024-04-29 5.25
2024-04-30 5.25
Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity
Date
2010-09-09 2.63
2010-09-10 2.67
2010-09-13 2.59
2010-09-14 2.53
2010-09-15 2.59
... ...
2024-04-24 -0.81
2024-04-25 -0.77
2024-04-26 -0.79
2024-04-29 -0.82
2024-04-30 -0.77
[3433 rows x 12 columns]
# Creating training and testing data for the model
train_df = master_df.sample(n=len(master_df)//2)
test_df = master_df.drop(train_df.index)
print(train_df.head(30))
SPY QQQ VOO IVV VTI Daily Federal Funds Rate \
Date
2018-03-02 269.08 165.99 247.36 271.20 138.24 1.42
2017-12-15 266.51 157.65 246.12 269.79 137.55 1.41
2019-01-14 257.40 159.27 236.58 259.11 132.00 2.40
2013-01-23 149.37 67.59 68.37 150.14 76.99 0.13
2019-12-02 311.64 203.00 286.30 313.41 158.57 1.56
2023-08-31 450.35 377.99 413.83 452.69 223.94 5.33
2020-05-18 295.00 227.43 271.08 296.04 148.60 0.05
2022-09-22 374.22 280.07 345.37 377.69 188.29 3.08
2012-06-04 128.10 60.87 58.63 128.60 65.54 0.17
2023-10-23 420.46 355.67 386.31 422.38 207.90 5.33
2012-11-27 140.33 64.96 64.29 140.93 71.91 0.16
2019-06-10 288.97 183.15 265.39 291.00 147.61 2.37
2017-11-15 256.44 152.59 235.53 258.18 131.69 1.16
2013-07-02 161.21 71.74 73.78 161.95 83.27 0.10
2015-05-15 212.44 109.58 194.73 213.91 109.94 0.13
2014-11-26 207.64 105.52 190.44 209.04 106.98 0.10
2016-04-25 208.61 108.98 191.35 209.66 106.54 0.37
2019-04-26 293.41 190.65 269.50 295.47 150.16 2.44
2023-05-24 411.09 331.65 377.63 412.84 204.26 5.08
2020-08-21 339.48 281.87 311.93 340.85 172.22 0.09
2016-10-31 212.55 116.99 195.13 213.69 108.89 0.31
2020-12-23 367.57 308.20 337.81 368.88 193.28 0.09
2011-05-25 132.39 56.79 60.55 132.81 68.47 0.09
2021-02-25 382.33 312.83 351.52 383.78 200.88 0.07
2015-09-17 199.70 107.13 183.09 200.90 103.48 0.14
2021-07-07 434.46 360.95 399.37 436.31 225.08 0.10
2019-09-06 298.05 191.59 273.74 299.83 151.68 2.12
2021-03-11 393.53 318.04 361.85 394.99 206.99 0.07
2013-01-31 149.70 66.87 68.56 150.46 77.25 0.15
2012-01-24 131.46 59.68 60.11 131.91 67.54 0.09
Interest Rate on Reserve Balances Nominal Broad US Dollar Index \
Date
2018-03-02 1.25 108.44870
2017-12-15 1.25 111.12860
2019-01-14 2.25 114.18785
2013-01-23 0.00 90.83760
2019-12-02 1.50 116.96640
2023-08-31 5.40 120.58820
2020-05-18 0.00 122.80550
2022-09-22 3.15 126.16680
2012-06-04 0.00 94.51840
2023-10-23 5.40 123.59350
2012-11-27 0.00 91.26080
2019-06-10 2.25 115.50120
2017-11-15 1.00 111.48090
2013-07-02 0.00 94.40510
2015-05-15 0.00 104.02020
2014-11-26 0.00 99.18930
2016-04-25 0.25 110.18650
2019-04-26 2.25 115.47490
2023-05-24 5.15 120.64810
2020-08-21 0.00 117.06820
2016-10-31 0.25 114.01880
2020-12-23 0.00 111.98730
2011-05-25 0.00 87.61360
2021-02-25 0.00 111.94020
2015-09-17 0.00 110.12120
2021-07-07 0.00 113.11360
2019-09-06 2.00 117.13050
2021-03-11 0.00 112.93450
2013-01-31 0.00 90.79800
2012-01-24 0.00 91.18100
Inflation Rate Federal Funds Target Upper Limit \
Date
2018-03-02 2.101333 1.50
2017-12-15 2.021333 1.50
2019-01-14 1.880333 2.50
2013-01-23 2.600000 0.25
2019-12-02 1.780333 1.75
2023-08-31 2.390000 5.50
2020-05-18 1.511000 0.25
2022-09-22 2.313448 3.25
2012-06-04 2.200000 0.25
2023-10-23 2.418667 5.50
2012-11-27 2.539655 0.25
2019-06-10 1.786207 2.50
2017-11-15 1.949310 1.25
2013-07-02 2.271667 0.25
2015-05-15 1.990667 0.25
2014-11-26 1.955172 0.25
2016-04-25 1.768276 0.50
2019-04-26 1.909655 2.50
2023-05-24 2.237000 5.25
2020-08-21 1.743333 0.25
2016-10-31 1.940000 0.50
2020-12-23 2.073333 0.25
2011-05-25 2.464000 0.25
2021-02-25 2.220000 0.25
2015-09-17 1.687931 0.25
2021-07-07 2.230000 0.25
2019-09-06 1.648276 2.25
2021-03-11 2.236667 0.25
2013-01-31 2.568966 0.25
2012-01-24 2.366667 0.25
Federal Funds Target Lower Limit \
Date
2018-03-02 1.25
2017-12-15 1.25
2019-01-14 2.25
2013-01-23 0.00
2019-12-02 1.50
2023-08-31 5.25
2020-05-18 0.00
2022-09-22 3.00
2012-06-04 0.00
2023-10-23 5.25
2012-11-27 0.00
2019-06-10 2.25
2017-11-15 1.00
2013-07-02 0.00
2015-05-15 0.00
2014-11-26 0.00
2016-04-25 0.25
2019-04-26 2.25
2023-05-24 5.00
2020-08-21 0.00
2016-10-31 0.25
2020-12-23 0.00
2011-05-25 0.00
2021-02-25 0.00
2015-09-17 0.00
2021-07-07 0.00
2019-09-06 2.00
2021-03-11 0.00
2013-01-31 0.00
2012-01-24 0.00
Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity
Date
2018-03-02 1.21
2017-12-15 1.04
2019-01-14 0.26
2013-01-23 1.78
2019-12-02 0.23
2023-08-31 -1.47
2020-05-18 0.60
2022-09-22 0.41
2012-06-04 1.45
2023-10-23 -0.72
2012-11-27 1.54
2019-06-10 -0.14
2017-11-15 1.08
2013-07-02 2.45
2015-05-15 2.12
2014-11-26 2.22
2016-04-25 1.66
2019-04-26 0.09
2023-05-24 -1.64
2020-08-21 0.54
2016-10-31 1.50
2020-12-23 0.87
2011-05-25 3.07
2021-02-25 1.50
2015-09-17 2.20
2021-07-07 1.28
2019-09-06 -0.41
2021-03-11 1.50
2013-01-31 1.95
2012-01-24 2.04
# Creating the quadratic training data
train_df['Daily Federal Funds Rate^2'] = train_df['Daily Federal Funds Rate'] ** 2
train_df['Interest Rate on Reserve Balances^2'] = train_df['Interest Rate on Reserve Balances'] ** 2
train_df['Nominal Broad US Dollar Index^2'] = train_df['Nominal Broad US Dollar Index'] ** 2
train_df['Inflation Rate^2'] = train_df['Inflation Rate'] ** 2
train_df['Federal Funds Target Upper Limit^2'] = train_df['Federal Funds Target Upper Limit'] ** 2
train_df['Federal Funds Target Lower Limit^2'] = train_df['Federal Funds Target Lower Limit'] ** 2
train_df['Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity^2'] = train_df['Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity'] ** 2
# Creating the quadratic testing data
test_df['Daily Federal Funds Rate^2'] = test_df['Daily Federal Funds Rate'] ** 2
test_df['Interest Rate on Reserve Balances^2'] = test_df['Interest Rate on Reserve Balances'] ** 2
test_df['Nominal Broad US Dollar Index^2'] = test_df['Nominal Broad US Dollar Index'] ** 2
test_df['Inflation Rate^2'] = train_df['Inflation Rate'] ** 2
test_df['Federal Funds Target Upper Limit^2'] = test_df['Federal Funds Target Upper Limit'] ** 2
test_df['Federal Funds Target Lower Limit^2'] = test_df['Federal Funds Target Lower Limit'] ** 2
test_df['Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity^2'] = test_df['Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity'] ** 2
We previously mentioned that machine learning models are given data and told how incorrect they are and then make adjustments based on how incorrect they are. This is done primarily through two different concepts in data science. The first is loss which is a way of quantifying how incorrect the model's guesses are while training. The second is gradient descent. Gradient descent how we actually make the adjustments to the model. The actual math gets quite complicated so its usually better to think of it abstractly in the form of a bunch of hills. The height of these hills is how much loss or how incorrect the model currently is. The higher up you are the worse your model is, the lower the better it is. Gradient descent is how we make sure we move down these mountains by choosing the direction down the hill that is steepest and then taking a step. The number of "steps" we take down this mountain we call the number of epochs. How big each step is is determined by the learning rate.
For our model we use a slightly more advanced version of gradient descent called Ridge Regression. This special version is useful for something called overfitting. Overfitting is when a model is really really really good at correctly predicting the training data but has become to familiar with the unique qualities and nuances of that data set that it is extremely inaccurate with the test data and might as well be randomized. The L2 parameter is what we use as a penalty to the loss function to prevent overfitting.
By running our model we get a set of parameters, similar to $\beta_0$ and $\beta_1$ in the linear regression, which are what we use to draw predictions for the future.
Loss Functions: https://en.wikipedia.org/wiki/Loss_function
Gradient Descent: https://www.youtube.com/watch?v=IHZwWFHWa-w
General Machine Learning: https://youtu.be/hfMk-kjRv4c?si=yRxNHRKPqfjQufjF
# Ridge Regression Gradient descent algorithm
def grad_descent_ridge(X,y,T,alpha,L2):
m, n = X.shape
theta = np.zeros((n,1))
f = np.zeros(T)
div = 1 / (m)
for i in range(T):
pred = X.dot(theta)
error = pred - y
SSE = np.linalg.norm(error) ** 2
f[i] = div*(SSE + ((L2 / 2) * np.sum(np.square(theta[1:]))))
g = np.transpose(X).dot(error)
theta -= alpha*g
return(theta,f)
# Selecting training indicators and the index fund to train against
X2 = np.c_[np.ones((train_df.shape[0], 1)), train_df["Nominal Broad US Dollar Index"], train_df["Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity"], train_df["Nominal Broad US Dollar Index^2"], train_df["Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity^2"]]
y = np.c_[train_df["SPY"]]
# Specifying training parameters
epochs = 10
learning_rate = 0.000000000001
L2 = 1000000000
# Running Gradient Descent to get model parameters
theta, f = grad_descent_ridge(X2, y, epochs, learning_rate, L2)
print(theta, f)
[[1.66589002e-06] [1.90377256e-04] [1.01255325e-06] [2.18649695e-02] [3.33591899e-06]] [83322.43157162 49263.75748955 30081.59596966 19280.76432849 13201.2368875 9780.76462484 7857.49669347 6776.95357947 6170.53392839 5830.69638322]
# Plotting error as the model trains to get a visual of increasing accuracy
# Allows for tuning number of epochs by seeing diminishing returns
plt.plot(f)
plt.xlabel("Epochs")
plt.ylabel("MSE")
plt.title("MSE vs Training Epochs")
Text(0.5, 1.0, 'MSE vs Training Epochs')
# Selecting testing indicators and the index fund to test against
test_data = np.c_[np.ones((test_df.shape[0], 1)), test_df["Nominal Broad US Dollar Index"], test_df["Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity"], test_df["Nominal Broad US Dollar Index^2"], test_df["Treasury Constant Maturity Minus 3-Month Treasury Constant Maturity^2"]]
test_result = test_data.dot(theta)
print(test_result)
[[191.99904263] [187.57823023] [188.84854832] ... [331.84995706] [332.43635982] [332.67473369]]
To actually determine how accurate the model is we run the test data through the model and calculate the $R^2$ value between the predicted and actual values. By doing this we get an $R^2$ of roughly .5. While this may seem bad at first as it's no better than a coin flip you have to take statistical values into context. Here we do not have just two potential predictions we can make but 100s. If our model were no better than randomness we'd be flipping a 300+ sided coin with a random probability of .003 meaning our model is over 150 times better.
actual_spy = test_df['SPY'].values.reshape(-1, 1)
# Calculate R-squared
r_squared = r2_score(actual_spy, test_result)
print("R-squared:", r_squared)
R-squared: 0.5464503698413415